0% found this document useful (0 votes)

73 views

C Class Notes

Uploaded by

pooja pillai

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views

C Class Notes

Uploaded by

pooja pillai

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Speech and Language Processing. Daniel Jurafsky & James H. Martin.

CHAPTER

WordNet: Word Relations,

C Senses, and Disambiguation
In this chapter we introduce computation with a thesaurus: a structured list of words
organized by meaning. The most popular thesaurus for computational purposes is
WordNet, a large online resource with versions in many languages. One use of
WordNet is to represent word senses, the many different meanings that a single
lemma can have (Chapter 6) Thus the lemma bank can refer to a financial institution
or to the sloping side of a river. WordNet also represents relations between senses,
like the IS-A relation between dog and mammal or the part-whole relationship be-
glosses tween car and engine. Finally, WordNet includes glosses, a definition for senses in
the form of a text string.
We’ll see how to use each of these aspects of WordNet to address the task of
computing word similarity; the similarity in meaning of two different words, an
alternative to the embedding-based methods we introduced in Chapter 6. And we’ll
word sense
disambiguation introduce word sense disambiguation, the task of determining which sense of a
word is being used in a particular context, a task with a long history in computational
linguistics and applications from machine translation to question answering. We
give a number of algorithms for using features from the context for deciding which
sense was intended in a particular context.

C.1 Word Senses

Consider the two uses of the lemma bank mentioned above, meaning something like
“financial institution” and “sloping mound”, respectively:
(C.1) Instead, a bank can hold the investments in a custodial account in the client’s
name.
(C.2) But as agriculture burgeons on the east bank, the river will shrink even more.
We represent this variation in usage by saying that the lemma bank has two
word sense senses.1 A sense (or word sense) is a discrete representation of one aspect of the
meaning of a word. Loosely following lexicographic tradition, we represent each
sense by placing a superscript on the lemma as in bank1 and bank2 .
The senses of a word might not have any particular relation between them; it
may be almost coincidental that they share an orthographic form. For example, the
financial institution and sloping mound senses of bank seem relatively unrelated.
Homonym In such cases we say that the two senses are homonyms, and the relation between
Homonymy the senses is one of homonymy. Thus bank1 (“financial institution”) and bank2
(“sloping mound”) are homonyms, as are the sense of bat meaning ‘club for hitting
a ball’ and the one meaning ‘nocturnal flying animal’. We say that these two uses
homographs of bank are homographs, as are the two uses of bat, because they are written the
1 Confusingly, the word “lemma” is itself ambiguous; it is also sometimes used to mean these separate
senses, rather than the citation form of the word. You should be prepared to see both uses in the literature.
2 A PPENDIX C • W ORD N ET: W ORD R ELATIONS , S ENSES , AND D ISAMBIGUATION

same. Two words can be homonyms in a different way if they are spelled differently
but pronounced the same, like write and right, or piece and peace. We call these
homophones homophones; they are one cause of real-word spelling errors.
Homonymy causes problems in other areas of language processing as well. In
question answering or information retrieval, we better help a user who typed “bat
care” if we know whether they are vampires or just want to play baseball. And
they will also have different translations; in Spanish the animal bat is a murciélago
while the baseball bat is a bate. Homographs that are pronounced differently cause
problems for speech synthesis (Chapter 28) such as these homographs of the word
bass, the fish pronounced b ae s and the instrument pronounced b ey s.
(C.3) The expert angler from Dora, Mo., was fly-casting for bass rather than the
traditional trout.
(C.4) The curtain rises to the sound of angry dogs baying and ominous bass chords
sounding.
Sometimes there is also some semantic connection between the senses of a word.
Consider the following example:
(C.5) While some banks furnish blood only to hospitals, others are less restrictive.
Although this is clearly not a use of the “sloping mound” meaning of bank, it just
as clearly is not a reference to a charitable giveaway by a financial institution. Rather,
bank has a whole range of uses related to repositories for various biological entities,
as in blood bank, egg bank, and sperm bank. So we could call this “biological
repository” sense bank3 . Now this new sense bank3 has some sort of relation to
bank1 ; both bank1 and bank3 are repositories for entities that can be deposited and
taken out; in bank1 the entity is monetary, whereas in bank3 the entity is biological.
When two senses are related semantically, we call the relationship between them
polysemy polysemy rather than homonymy. In many cases of polysemy, the semantic relation
between the senses is systematic and structured. For example, consider yet another
sense of bank, exemplified in the following sentence:
(C.6) The bank is on the corner of Nassau and Witherspoon.
This sense, which we can call bank4 , means something like “the building be-
longing to a financial institution”. It turns out that these two kinds of senses (an
organization and the building associated with an organization ) occur together for
many other words as well (school, university, hospital, etc.). Thus, there is a sys-
tematic relationship between senses that we might represent as
BUILDING ↔ ORGANIZATION
metonymy This particular subtype of polysemy relation is often called metonymy. Metonymy
is the use of one aspect of a concept or entity to refer to other aspects of the entity
or to the entity itself. Thus, we are performing metonymy when we use the phrase
the White House to refer to the administration whose office is in the White House.
Other common examples of metonymy include the relation between the following
pairings of senses:
Author (Jane Austen wrote Emma) ↔ Works of Author (I really love Jane Austen)
Tree (Plums have beautiful blossoms) ↔ Fruit (I ate a preserved plum yesterday)
While it can be useful to distinguish polysemy from unrelated homonymy, there
is no hard threshold for how related two senses must be to be considered polyse-
mous. Thus, the difference is really one of degree. This fact can make it very difficult
to decide how many senses a word has, that is, whether to make separate senses for
C.1 • W ORD S ENSES 3

closely related usages. There are various criteria for deciding that the differing uses
of a word should be represented with discrete senses. We might consider two senses
discrete if they have independent truth conditions, different syntactic behavior, and
independent sense relations, or if they exhibit antagonistic meanings.
Consider the following uses of the verb serve from the WSJ corpus:
(C.7) They rarely serve red meat, preferring to prepare seafood.
(C.8) He served as U.S. ambassador to Norway in 1976 and 1977.
(C.9) He might have served his time, come out and led an upstanding life.
The serve of serving red meat and that of serving time clearly have different truth
conditions and presuppositions; the serve of serve as ambassador has the distinct
subcategorization structure serve as NP. These heuristics suggest that these are prob-
ably three distinct senses of serve. One practical technique for determining if two
senses are distinct is to conjoin two uses of a word in a single sentence; this kind of
zeugma conjunction of antagonistic readings is called zeugma. Consider the following ATIS
examples:
(C.10) Which of those flights serve breakfast?
(C.11) Does Midwest Express serve Philadelphia?
(C.12) ?Does Midwest Express serve breakfast and Philadelphia?
We use (?) to mark those examples that are semantically ill-formed. The oddness of
the invented third example (a case of zeugma) indicates there is no sensible way to
make a single sense of serve work for both breakfast and Philadelphia. We can use
this as evidence that serve has two different senses in this case.
Dictionaries tend to use many fine-grained senses so as to capture subtle meaning
differences, a reasonable approach given that the traditional role of dictionaries is
aiding word learners. For computational purposes, we often don’t need these fine
distinctions, so we may want to group or cluster the senses; we have already done
this for some of the examples in this chapter.
How can we define the meaning of a word sense? We introduced in Chapter 6 the
standard computational approach of representing a word as an embedding, a point in
semantic space. The intuition was that words were defined by their co-occurrences,
the counts of words that often occur nearby.
Thesauri offer an alternative way of defining words. But we can’t just look at
the definition itself. Consider the following fragments from the definitions of right,
left, red, and blood from the American Heritage Dictionary (Morris, 1985).
right adj. located nearer the right hand esp. being on the right when
facing the same direction as the observer.
left adj. located nearer to this side of the body than the right.
red n. the color of blood or a ruby.
blood n. the red liquid that circulates in the heart, arteries and veins of
animals.
Note the circularity in these definitions. The definition of right makes two direct
references to itself, and the entry for left contains an implicit self-reference in the
phrase this side of the body, which presumably means the left side. The entries for
red and blood reference each other in their definitions. Such circularity is inherent
in all dictionary definitions. For humans, such entries are still useful since the user
of the dictionary has sufficient grasp of these other terms.
For computational purposes, one approach to defining a sense is—like the dic-
tionary definitions—defining a sense through its relationship with other senses. For
4 A PPENDIX C • W ORD N ET: W ORD R ELATIONS , S ENSES , AND D ISAMBIGUATION

example, the above definitions make it clear that right and left are similar kinds of
lemmas that stand in some kind of alternation, or opposition, to one another. Simi-
larly, we can glean that red is a color, that it can be applied to both blood and rubies,
and that blood is a liquid. Sense relations of this sort are embodied in on-line
databases like WordNet. Given a sufficiently large database of such relations, many
applications are quite capable of performing sophisticated semantic tasks (even if
they do not really know their right from their left).

C.1.1 Relations Between Senses

This section explores some of the relations that hold among word senses, focus-
ing on a few that have received significant computational investigation: synonymy,
antonymy, and hypernymy, as well as a brief mention of other relations like meronymy.

Synonymy We introduced in Chapter 6 the idea that when two senses of two dif-
ferent words (lemmas) are identical, or nearly identical, we say the two senses are
synonym synonyms. Synonyms include such pairs as
couch/sofa vomit/throw up filbert/hazelnut car/automobile

And we mentioned that in practice, the word synonym is commonly used to

describe a relationship of approximate or rough synonymy. But furthermore, syn-
onymy is actually a relationship between senses rather than words. Considering the
words big and large. These may seem to be synonyms in the following ATIS sen-
tences, since we could swap big and large in either sentence and retain the same
meaning:
(C.13) How big is that plane?
(C.14) Would I be flying on a large or small plane?
But note the following WSJ sentence in which we cannot substitute large for big:
(C.15) Miss Nelson, for instance, became a kind of big sister to Benjamin.
(C.16) ?Miss Nelson, for instance, became a kind of large sister to Benjamin.
This is because the word big has a sense that means being older or grown up, while
large lacks this sense. Thus, we say that some senses of big and large are (nearly)
synonymous while other ones are not.
hyponym Hyponymy One sense is a hyponym of another sense if the first sense is more
specific, a subclass. For example, car is a hyponym of vehicle; dog is a hyponym
hypernym of animal, and mango is a hyponym of fruit. Conversely, vehicle is a hypernym of
car, and animal is a hypernym of dog. It is unfortunate that the two words hypernym
and hyponym are very similar and hence easily confused; for this reason, the word
superordinate superordinate is often used instead of hypernym.
Superordinate vehicle fruit furniture mammal
Hyponym car mango chair dog

meronymy Meronymy Another common relation is meronymy, the part-whole relation. A

part-whole leg is part of a chair; a wheel is part of a car. We say that wheel is a meronym of
meronym car, and car is a holonym of wheel.
holonym
C.2 • W ORD N ET: A DATABASE OF L EXICAL R ELATIONS 5

C.2 WordNet: A Database of Lexical Relations

WordNet The most commonly used resource for English sense relations is the WordNet lex-
ical database (Fellbaum, 1998). WordNet consists of three separate databases, one
each for nouns and verbs and a third for adjectives and adverbs; closed class words
are not included. Each database contains a set of lemmas, each one annotated with a
set of senses. The WordNet 3.0 release has 117,798 nouns, 11,529 verbs, 22,479 ad-
jectives, and 4,481 adverbs. The average noun has 1.23 senses, and the average verb
has 2.16 senses. WordNet can be accessed on the Web or downloaded and accessed
locally. Figure C.1 shows the lemma entry for the noun and adjective bass.

The noun “bass” has 8 senses in WordNet.

1. bass1 - (the lowest part of the musical range)
2. bass2 , bass part1 - (the lowest part in polyphonic music)
3. bass3 , basso1 - (an adult male singer with the lowest voice)
4. sea bass1 , bass4 - (the lean flesh of a saltwater fish of the family Serranidae)
5. freshwater bass1 , bass5 - (any of various North American freshwater fish with
lean flesh (especially of the genus Micropterus))
6. bass6 , bass voice1 , basso2 - (the lowest adult male singing voice)
7. bass7 - (the member with the lowest range of a family of musical instruments)
8. bass8 - (nontechnical name for any of numerous edible marine and
freshwater spiny-finned fishes)
The adjective “bass” has 1 sense in WordNet.
1. bass1 , deep6 - (having or denoting a low vocal or instrumental range)
“a deep voice”; “a bass voice is lower than a baritone voice”;
“a bass clarinet”
Figure C.1 A portion of the WordNet 3.0 entry for the noun bass.

Note that there are eight senses for the noun and one for the adjective, each of
gloss which has a gloss (a dictionary-style definition), a list of synonyms for the sense, and
sometimes also usage examples (shown for the adjective sense). Unlike dictionaries,
WordNet doesn’t represent pronunciation, so doesn’t distinguish the pronunciation
[b ae s] in bass4 , bass5 , and bass8 from the other senses pronounced [b ey s].
synset The set of near-synonyms for a WordNet sense is called a synset (for synonym
set); synsets are an important primitive in WordNet. The entry for bass includes
synsets like {bass1 , deep6 }, or {bass6 , bass voice1 , basso2 }. We can think of a
synset as representing a concept of the type we discussed in Chapter 14. Thus,
instead of representing concepts in logical terms, WordNet represents them as lists
of the word senses that can be used to express the concept. Here’s another synset
example:
{chump1 , fool2 , gull1 , mark9 , patsy1 , fall guy1 ,
sucker1 , soft touch1 , mug2 }
The gloss of this synset describes it as a person who is gullible and easy to take
advantage of. Each of the lexical entries included in the synset can, therefore, be
used to express this concept. Synsets like this one actually constitute the senses
associated with WordNet entries, and hence it is synsets, not wordforms, lemmas, or
individual senses, that participate in most of the lexical sense relations in WordNet.
WordNet represents all the kinds of sense relations discussed in the previous sec-
tion, as illustrated in Fig. C.2 and Fig. C.3. WordNet hyponymy relations correspond
6 A PPENDIX C • W ORD N ET: W ORD R ELATIONS , S ENSES , AND D ISAMBIGUATION

Relation Also Called Definition Example

Hypernym Superordinate From concepts to superordinates breakfast1 → meal1
Hyponym Subordinate From concepts to subtypes meal1 → lunch1
Instance Hypernym Instance From instances to their concepts Austen1 → author1
Instance Hyponym Has-Instance From concepts to concept instances composer1 → Bach1
Member Meronym Has-Member From groups to their members faculty2 → professor1
Member Holonym Member-Of From members to their groups copilot1 → crew1
Part Meronym Has-Part From wholes to parts table2 → leg3
Part Holonym Part-Of From parts to wholes course7 → meal1
Substance Meronym From substances to their subparts water1 → oxygen1
Substance Holonym From parts of substances to wholes gin1 → martini1
Antonym Semantic opposition between lemmas leader1 ⇐⇒ follower1
Derivationally Lemmas w/same morphological root destruction1 ⇐⇒ destroy1
Related Form
Figure C.2 Noun relations in WordNet.

Relation Definition Example

Hypernym From events to superordinate events fly9 → travel5
Troponym From events to subordinate event walk1 → stroll1
(often via specific manner)
Entails From verbs (events) to the verbs (events) they entail snore1 → sleep1
Antonym Semantic opposition between lemmas increase1 ⇐⇒ decrease1
Derivationally Lemmas with same morphological root destroy1 ⇐⇒ destruction1
Related Form
Figure C.3 Verb relations in WordNet.

to the notion of immediate hyponymy discussed on page 4. Each synset is related

to its immediately more general and more specific synsets through direct hypernym
and hyponym relations. These relations can be followed to produce longer chains of
more general or more specific synsets. Figure C.4 shows hypernym chains for bass3
and bass7 .
In this depiction of hyponymy, successively more general synsets are shown on
successive indented lines. The first chain starts from the concept of a human bass
singer. Its immediate superordinate is a synset corresponding to the generic concept
of a singer. Following this chain leads eventually to concepts such as entertainer and
person. The second chain, which starts from musical instrument, has a completely
different path leading eventually to such concepts as musical instrument, device, and
physical object. Both paths do eventually join at the very abstract synset whole, unit,
and then proceed together to entity which is the top (root) of the noun hierarchy (in
unique
beginner WordNet this root is generally called the unique beginner).

C.3 Word Similarity: Thesaurus Methods

In Chapter 6 we introduced the embedding and cosine architecture for computing the
similarity between two words. A thesaurus offers a different family of algorithms
that can be complementary.
Although we have described them as relations between words, similar is actually
a relationship between word senses. For example, of the two senses of bank, we
C.3 • W ORD S IMILARITY: T HESAURUS M ETHODS 7

Sense 3
bass, basso --
(an adult male singer with the lowest voice)
=> singer, vocalist, vocalizer, vocaliser
=> musician, instrumentalist, player
=> performer, performing artist
=> entertainer
=> person, individual, someone...
=> organism, being
=> living thing, animate thing,
=> whole, unit
=> object, physical object
=> physical entity
=> entity
=> causal agent, cause, causal agency
=> physical entity
=> entity
Sense 7
bass --
(the member with the lowest range of a family of
musical instruments)
=> musical instrument, instrument
=> device
=> instrumentality, instrumentation
=> artifact, artefact
=> whole, unit
=> object, physical object
=> physical entity
=> entity
Figure C.4 Hyponymy chains for two separate senses of the lemma bass. Note that the
chains are completely distinct, only converging at the very abstract level whole, unit.

might say that the financial sense is similar to one of the senses of fund and the
riparian sense is more similar to one of the senses of slope. In the next few sections
of this chapter, we will compute these relations over both words and senses.
The thesaurus-based algorithms use the structure of the thesaurus to define word
similarity. In principle, we could measure similarity by using any information avail-
able in a thesaurus (meronymy, glosses, etc.). In practice, however, thesaurus-based
word similarity algorithms generally use only the hypernym/hyponym (is-a or sub-
sumption) hierarchy. In WordNet, verbs and nouns are in separate hypernym hier-
archies, so a thesaurus-based algorithm for WordNet can thus compute only noun-
noun similarity, or verb-verb similarity; we can’t compare nouns to verbs or do
anything with adjectives or other parts of speech.
The simplest thesaurus-based algorithms are based on the intuition that words
or senses are more similar if there is a shorter path between them in the thesaurus
graph, an intuition dating back to Quillian (1969). A word/sense is most similar to
itself, then to its parents or siblings, and least similar to words that are far away. We
make this notion operational by measuring the number of edges between the two
concept nodes in the thesaurus graph and adding one. Figure C.5 shows an intuition;
the concept dime is most similar to nickel and coin, less similar to money, and even
less similar to Richter scale. A formal definition:
pathlen(c1 , c2 ) = 1 + the number of edges in the shortest path in the
8 A PPENDIX C • W ORD N ET: W ORD R ELATIONS , S ENSES , AND D ISAMBIGUATION

Figure C.5 A fragment of the WordNet hypernym hierarchy, showing path lengths (number
of edges plus 1) from nickel to coin (2), dime (3), money (6), and Richter scale (8).

thesaurus graph between the sense nodes c1 and c2

Path-based similarity can be defined as just the path length, transformed either by
log (Leacock and Chodorow, 1998) or, more often, by an inverse, resulting in the
path-length
based similarity following common definition of path-length based similarity:
1
simpath (c1 , c2 ) = (C.17)
pathlen(c1 , c2 )
For most applications, we don’t have sense-tagged data, and thus we need our
algorithm to give us the similarity between words rather than between senses or con-
cepts. For any of the thesaurus-based algorithms, following Resnik (1995), we can
approximate the correct similarity (which would require sense disambiguation) by
just using the pair of senses for the two words that results in maximum sense sim-
word similarity ilarity. Thus, based on sense similarity, we can define word similarity as follows:

wordsim(w1 , w2 ) = max sim(c1 , c2 ) (C.18)

c1 ∈senses(w1 )
c2 ∈senses(w2 )

The basic path-length algorithm makes the implicit assumption that each link
in the network represents a uniform distance. In practice, this assumption is not
appropriate. Some links (e.g., those that are deep in the WordNet hierarchy) often
seem to represent an intuitively narrow distance, while other links (e.g., higher up
in the WordNet hierarchy) represent an intuitively wider distance. For example, in
Fig. C.5, the distance from nickel to money (5) seems intuitively much shorter than
the distance from nickel to an abstract word standard; the link between medium of
exchange and standard seems wider than that between, say, coin and coinage.
It is possible to refine path-based algorithms with normalizations based on depth
in the hierarchy (Wu and Palmer, 1994), but in general we’d like an approach that
lets us independently represent the distance associated with each edge.
A second class of thesaurus-based similarity algorithms attempts to offer just
information- such a fine-grained metric. These information-content word-similarity algorithms
content
still rely on the structure of the thesaurus but also add probabilistic information
derived from a corpus.
Following Resnik (1995) we’ll define P(c) as the probability that a randomly
selected word in a corpus is an instance of concept c (i.e., a separate random variable,
ranging over words, associated with each concept). This implies that P(root) = 1
since any word is subsumed by the root concept. Intuitively, the lower a concept
C.3 • W ORD S IMILARITY: T HESAURUS M ETHODS 9

in the hierarchy, the lower its probability. We train these probabilities by counting
in a corpus; each word in the corpus counts as an occurrence of each concept that
contains it. For example, in Fig. C.5 above, an occurrence of the word dime would
count toward the frequency of coin, currency, standard, etc. More formally, Resnik
computes P(c) as follows:
P
w∈words(c) count(w)
P(c) = (C.19)
N
where words(c) is the set of words subsumed by concept c, and N is the total number
of words in the corpus that are also present in the thesaurus.
Figure C.6, from Lin (1998), shows a fragment of the WordNet concept hierar-
chy augmented with the probabilities P(c).

entity 0.395

inanimate-object 0.167

natural-object 0.0163

geological-formation 0.00176

0.000113 natural-elevation shore 0.0000836

0.0000189 hill coast 0.0000216

Figure C.6 A fragment of the WordNet hierarchy, showing the probability P(c) attached to
each content, adapted from a figure from Lin (1998).

We now need two additional definitions. First, following basic information the-
ory, we define the information content (IC) of a concept c as

IC(c) = − log P(c) (C.20)

Lowest
common Second, we define the lowest common subsumer or LCS of two concepts:
subsumer
LCS
LCS(c1 , c2 ) = the lowest common subsumer, that is, the lowest node in
the hierarchy that subsumes (is a hypernym of) both c1 and c2
There are now a number of ways to use the information content of a node in a
word similarity metric. The simplest way was first proposed by Resnik (1995). We
think of the similarity between two words as related to their common information;
the more two words have in common, the more similar they are. Resnik proposes
to estimate the common amount of information by the information content of the
Resnik
similarity lowest common subsumer of the two nodes. More formally, the Resnik similarity
measure is

simresnik (c1 , c2 ) = − log P(LCS(c1 , c2 )) (C.21)

Lin (1998) extended the Resnik intuition by pointing out that a similarity metric
between objects A and B needs to do more than measure the amount of information
in common between A and B. For example, he additionally pointed out that the more
differences between A and B, the less similar they are. In summary:
10 A PPENDIX C • W ORD N ET: W ORD R ELATIONS , S ENSES , AND D ISAMBIGUATION

• Commonality: the more information A and B have in common, the more

similar they are.
• Difference: the more differences between the information in A and B, the less
similar they are.
Lin measures the commonality between A and B as the information content of
the proposition that states the commonality between A and B:
IC(common(A,B)) (C.22)
He measures the difference between A and B as
IC(description(A,B)) − IC(common(A,B)) (C.23)
where description(A,B) describes A and B. Given a few additional assumptions
about similarity, Lin proves the following theorem:
Similarity Theorem: The similarity between A and B is measured by the ratio
between the amount of information needed to state the commonality of A and
B and the information needed to fully describe what A and B are.

common(A,B)
simLin (A, B) = (C.24)
description(A,B)
Applying this idea to the thesaurus domain, Lin shows (in a slight modification
of Resnik’s assumption) that the information in common between two concepts is
twice the information in the lowest common subsumer LCS(c1 , c2 ). Adding in the
above definitions of the information content of thesaurus concepts, the final Lin
Lin similarity similarity function is

2 × log P(LCS(c1 , c2 ))
simLin (c1 , c2 ) = (C.25)
log P(c1 ) + log P(c2 )
For example, using simLin , Lin (1998) shows that the similarity between the
concepts of hill and coast from Fig. C.6 is

2 × log P(geological-formation)
simLin (hill, coast) = = 0.59 (C.26)
log P(hill) + log P(coast)
Jiang-Conrath A similar formula, Jiang-Conrath distance (Jiang and Conrath, 1997), although
distance
derived in a completely different way from Lin and expressed as a distance rather
than similarity function, has been shown to work as well as or better than all the
other thesaurus-based methods:

distJC (c1 , c2 ) = 2 × log P(LCS(c1 , c2 )) − (log P(c1 ) + log P(c2 )) (C.27)

We can transform distJC into a similarity by taking the reciprocal.
Finally, we describe a dictionary-based method that is related to the Lesk al-
gorithm for word sense disambiguation we will introduce in Section C.6.1. The
Extended gloss
overlap intution of extended gloss overlap, or extended Lesk measure (Banerjee and Ped-
extended Lesk ersen, 2003) is that two concepts/senses in a thesaurus are similar if their glosses
contain overlapping words. We’ll begin by sketching an overlap function for two
glosses. Consider these two concepts, with their glosses:
C.3 • W ORD S IMILARITY: T HESAURUS M ETHODS 11

• drawing paper: paper that is specially prepared for use in drafting

• decal: the art of transferring designs from specially prepared paper to a wood
or glass or metal surface.
For each n-word phrase that occurs in both glosses, Extended Lesk adds in a
score of n2 (the relation is non-linear because of the Zipfian relationship between
lengths of phrases and their corpus frequencies; longer overlaps are rare, so they
should be weighted more heavily). Here, the overlapping phrases are paper and
specially prepared, for a total similarity score of 12 + 22 = 5.
Given such an overlap function, when comparing two concepts (synsets), Ex-
tended Lesk not only looks for overlap between their glosses but also between the
glosses of the senses that are hypernyms, hyponyms, meronyms, and other relations
of the two concepts. For example, if we just considered hyponyms and defined
gloss(hypo(A)) as the concatenation of all the glosses of all the hyponym senses of
A, the total relatedness between two concepts A and B might be

similarity(A,B) = overlap(gloss(A), gloss(B))

+overlap(gloss(hypo(A)), gloss(hypo(B)))
+overlap(gloss(A), gloss(hypo(B)))
+overlap(gloss(hypo(A)),gloss(B))

Let RELS be the set of possible WordNet relations whose glosses we compare;
assuming a basic overlap measure as sketched above, we can then define the Ex-
tended Lesk overlap measure as

X
simeLesk (c1 , c2 ) = overlap(gloss(r(c1 )), gloss(q(c2 ))) (C.28)
r,q∈RELS

1
simpath (c1 , c2 ) =
pathlen(c1 , c2 )
simResnik (c1 , c2 ) = − log P(LCS(c1 , c2 ))
2 × log P(LCS(c1 , c2 ))
simLin (c1 , c2 ) =
log P(c1 ) + log P(c2 )
1
simJC (c1 , c2 ) =
2 × log P(LCS(c1 , c2 )) − (log P(c1 ) + log P(c2 ))
X
simeLesk (c1 , c2 ) = overlap(gloss(r(c1 )), gloss(q(c2 )))
r,q∈RELS

Figure C.7 Five thesaurus-based (and dictionary-based) similarity measures.

Figure C.7 summarizes the five similarity measures we have described in this
section.

Evaluating Thesaurus-Based Similarity

Which of these similarity measures is best? Word similarity measures have been
evaluated in two ways, introduced in Chapter 6. The most common intrinsic evalu-
ation metric computes the correlation coefficient between an algorithm’s word sim-
ilarity scores and word similarity ratings assigned by humans. There are a variety
12 A PPENDIX C • W ORD N ET: W ORD R ELATIONS , S ENSES , AND D ISAMBIGUATION

of such human-labeled datasets: the RG-65 dataset of human similarity ratings on

65 word pairs (Rubenstein and Goodenough, 1965), the MC-30 dataset of 30 word
pairs (Miller and Charles, 1991). The WordSim-353 (Finkelstein et al., 2002) is a
commonly used set of of ratings from 0 to 10 for 353 noun pairs; for example (plane,
car) had an average score of 5.77. SimLex-999 (Hill et al., 2015) is a more difficult
dataset that quantifies similarity (cup, mug) rather than relatedness (cup, coffee), and
including both concrete and abstract adjective, noun and verb pairs. Another com-
mon intrinic similarity measure is the TOEFL dataset, a set of 80 questions, each
consisting of a target word with 4 additional word choices; the task is to choose
which is the correct synonym, as in the example: Levied is closest in meaning to:
imposed, believed, requested, correlated (Landauer and Dumais, 1997). All of these
datasets present words without context.
Slightly more realistic are intrinsic similarity tasks that include context. The
Stanford Contextual Word Similarity (SCWS) dataset (Huang et al., 2012) offers a
richer evaluation scenario, giving human judgments on 2,003 pairs of words in their
sentential context, including nouns, verbs, and adjectives. This dataset enables the
evaluation of word similarity algorithms that can make use of context words. The
semantic textual similarity task (Agirre et al. 2012, Agirre et al. 2015) evaluates the
performance of sentence-level similarity algorithms, consisting of a set of pairs of
sentences, each pair with human-labeled similarity scores.
Alternatively, the similarity measure can be embedded in some end-application,
such as question answering or spell-checking, and different measures can be evalu-
ated by how much they improve the end application.

C.4 Word Sense Disambiguation: Overview

The task of selecting the correct sense for a word is called word sense disambigua-
word sense
disambiguation tion, or WSD. WSD algorithms take as input a word in context and a fixed inventory
WSD of potential word senses and outputs the correct word sense in context. The input and
the senses depends on the task. For machine translation from English to Spanish, the
sense tag inventory for an English word might be the set of different Spanish trans-
lations. For automatic indexing of medical articles, the sense-tag inventory might be
the set of MeSH (Medical Subject Headings) thesaurus entries.
When we are evaluating WSD in isolation, we can use the set of senses from a
dictionary/thesaurus resource like WordNet. Figure C.4 shows an example for the
word bass, which can refer to a musical instrument or a kind of fish.2

WordNet Spanish Roget

Sense Translation Category Target Word in Context
bass4 lubina FISH / INSECT . . . fish as Pacific salmon and striped bass and. . .
bass4 lubina FISH / INSECT . . . produce filets of smoked bass or sturgeon. . .
bass7 bajo MUSIC . . . exciting jazz bass player since Ray Brown. . .
bass7 bajo MUSIC . . . play bass because he doesn’t have to solo. . .
Figure C.8 Possible definitions for the inventory of sense tags for bass.

lexical sample It is useful to distinguish two WSD tasks. In the lexical sample task, a small
2 The WordNet database includes eight senses; we have arbitrarily selected two for this example; we
have also arbitrarily selected one of the many Spanish fishes that could translate English sea bass.
C.5 • S UPERVISED W ORD S ENSE D ISAMBIGUATION 13

pre-selected set of target words is chosen, along with an inventory of senses for each
word from some lexicon. Since the set of words and the set of senses are small,
simple supervised classification approaches are used.
all-words In the all-words task, systems are given entire texts and a lexicon with an inven-
tory of senses for each entry and are required to disambiguate every content word in
the text. The all-words task is similar to part-of-speech tagging, except with a much
larger set of tags since each lemma has its own set. A consequence of this larger set
of tags is data sparseness; it is unlikely that adequate training data for every word in
the test set will be available. Moreover, given the number of polysemous words in
reasonably sized lexicons, approaches based on training one classifier per term are
unlikely to be practical.

C.5 Supervised Word Sense Disambiguation

Supervised WSD is commonly used whenever we have sufficient data that has been
hand-labeled with correct word senses.
Datasets: The are various lexical sample datasets with context sentences labeled
with the correct sense for the target word, such as the line-hard-serve corpus with
4,000 sense-tagged examples of line as a noun, hard as an adjective and serve as a
verb (Leacock et al., 1993), and the interest corpus with 2,369 sense-tagged exam-
ples of interest as a noun (Bruce and Wiebe, 1994). The SENSEVAL project has also
produced a number of such sense-labeled lexical sample corpora (SENSEVAL -1 with
34 words from the HECTOR lexicon and corpus (Kilgarriff and Rosenzweig 2000,
Atkins 1993), SENSEVAL -2 and -3 with 73 and 57 target words, respectively (Palmer
et al. 2001, Kilgarriff 2001). All-word disambiguation tasks are trained from a se-
semantic mantic concordance, a corpus in which each open-class word in each sentence is
concordance
labeled with its word sense from a specific dictionary or thesaurus. One commonly
used corpus is SemCor, a subset of the Brown Corpus consisting of over 234,000
words that were manually tagged with WordNet senses (Miller et al. 1993, Landes
et al. 1998). In addition, sense-tagged corpora have been built for the SENSEVAL all-
word tasks. The SENSEVAL-3 English all-words test data consisted of 2081 tagged
content word tokens, from 5,000 total running words of English from the WSJ and
Brown corpora (Palmer et al., 2001).
Features Supervised WSD algorithms can use any standard classification algo-
rithm. Features generally include the word identity, part-of-speech tags, and embed-
collocation dings of surrounding words, usually computed in two ways: collocation features are
words or n-grams at a particular location, (i.e., exactly one word to the right, or the
bag of word two words starting 3 words to the left, and so on). bag of word features are rep-
resented as a vector with the dimensionality of the vocabulary (minus stop words),
with a 1 if that word occurs in the in the neighborhood of the target word.
Consider the ambiguous word bass in the following WSJ sentence:
(C.29) An electric guitar and bass player stand off to one side,
If we used a small 2-word window, a standard feature vector might include a bag of
words, parts-of-speech, unigram and bigram collocation features, and embeddings,
that is:
[wi−2 , POSi−2 , wi−1 , POSi−1 , wi+1 , POSi+1 , wi+2 , POSi+2 ,
wi−1 i+2
i−2 , wi+1 , E(wi−2 , wi−1 , wi+1 , wi+2 ), bag()] (C.30)
14 A PPENDIX C • W ORD N ET: W ORD R ELATIONS , S ENSES , AND D ISAMBIGUATION

would yield the following vector:

[guitar, NN, and, CC, player, NN, stand, VB, and guitar, player stand,
E(guitar,and,player,stand), bag(guitar,player,stand)]
High performing systems generally use POS tags and word collocations of length
1, 2, and 3 from a window of words 3 to the left and 3 to the right (Zhong and Ng,
2010). The embedding function could just take the average of the embeddings of
the words in the window, or a more complicated embedding function can be used
(Iacobacci et al., 2016).

C.5.1 Wikipedia as a source of training data

One way to increase the amount of training data is to use Wikipedia as a source of
sense-labeled data. When a concept is mentioned in a Wikipedia article, the article
text may contain an explicit link to the concept’s Wikipedia page, which is named
by a unique identifier. This link can be used as a sense annotation. For example,
the ambiguous word bar is linked to a different Wikipedia article depending on its
meaning in context, including the page BAR (L AW ), the page BAR (M USIC ), and
so on, as in the following Wikipedia examples (Mihalcea, 2007).
In 1834, Sumner was admitted to the [[bar (law)|bar]] at the age of
twenty-three, and entered private practice in Boston.
It is danced in 3/4 time (like most waltzes), with the couple turning
approx. 180 degrees every [[bar (music)|bar]].
Jenga is a popular beer in the [[bar (establishment)|bar]]s of Thailand.
These sentences can then be added to the training data for a supervised system.
In order to use Wikipedia in this way, however, it is necessary to map from Wikipedia
concepts to whatever inventory of senses is relevant for the WSD application. Auto-
matic algorithms that map from Wikipedia to WordNet, for example, involve finding
the WordNet sense that has the greatest lexical overlap with the Wikipedia sense, by
comparing the vector of words in the WordNet synset, gloss, and related senses with
the vector of words in the Wikipedia page title, outgoing links, and page category
(Ponzetto and Navigli, 2010).

C.5.2 Evaluation
extrinsic To evaluate WSD algorithms, it’s better to consider extrinsic, task-based, or end-
evaluation
to-end evaluation, in which we see whether some new WSD idea actually improves
performance in some end-to-end application like question answering or machine
translation. Nonetheless, because extrinsic evaluations are difficult and slow, WSD
intrinsic systems are typically evaluated with intrinsic evaluation. in which a WSD compo-
nent is treated as an independent system. Common intrinsic evaluations are either
sense accuracy exact-match sense accuracy—the percentage of words that are tagged identically
with the hand-labeled sense tags in a test set—or with precision and recall if sys-
tems are permitted to pass on the labeling of some instances. In general, we evaluate
by using held-out data from the same sense-tagged corpora that we used for training,
such as the SemCor corpus discussed above or the various corpora produced by the
SENSEVAL effort.
Many aspects of sense evaluation have been standardized by the SENSEVAL and
SEMEVAL efforts (Palmer et al. 2006, Kilgarriff and Palmer 2000). This framework
provides a shared task with training and testing materials along with sense invento-
ries for all-words and lexical sample tasks in a variety of languages.
C.6 • WSD: D ICTIONARY AND T HESAURUS M ETHODS 15

most frequent The normal baseline is to choose the most frequent sense for each word from the
sense
senses in a labeled corpus (Gale et al., 1992a). For WordNet, this corresponds to the
first sense, since senses in WordNet are generally ordered from most frequent to least
frequent. WordNet sense frequencies come from the SemCor sense-tagged corpus
described above– WordNet senses that don’t occur in SemCor are ordered arbitrarily
after those that do. The most frequent sense baseline can be quite accurate, and is
therefore often used as a default, to supply a word sense when a supervised algorithm
has insufficient training data.

C.6 WSD: Dictionary and Thesaurus Methods

Supervised algorithms based on sense-labeled corpora are the best-performing algo-
rithms for sense disambiguation. However, such labeled training data is expensive
and limited. One alternative is to get indirect supervision from dictionaries and the-
sauruses, and so this method is also called knowledge-based WSD. Methods like
this that do not use texts that have been hand-labeled with senses are also called
weakly supervised.

C.6.1 The Lesk Algorithm

The most well-studied dictionary-based algorithm for sense disambiguation is the
Lesk algorithm Lesk algorithm, really a family of algorithms that choose the sense whose dictio-
nary gloss or definition shares the most words with the target word’s neighborhood.
Figure C.9 shows the simplest version of the algorithm, often called the Simplified
Simplified Lesk Lesk algorithm (Kilgarriff and Rosenzweig, 2000).

function S IMPLIFIED L ESK(word, sentence) returns best sense of word

best-sense ← most frequent sense for word

max-overlap ← 0
context ← set of words in sentence
for each sense in senses of word do
signature ← set of words in the gloss and examples of sense
overlap ← C OMPUTE OVERLAP(signature, context)
if overlap > max-overlap then
max-overlap ← overlap
best-sense ← sense
end
return(best-sense)

Figure C.9 The Simplified Lesk algorithm. The C OMPUTE OVERLAP function returns the
number of words in common between two sets, ignoring function words or other words on a
stop list. The original Lesk algorithm defines the context in a more complex way. The Cor-
pus Lesk algorithm weights each overlapping word w by its − log P(w) and includes labeled
training corpus data in the signature.

As an example of the Lesk algorithm at work, consider disambiguating the word

bank in the following context:
(C.31) The bank can guarantee deposits will eventually cover future tuition costs
because it invests in adjustable-rate mortgage securities.
16 A PPENDIX C • W ORD N ET: W ORD R ELATIONS , S ENSES , AND D ISAMBIGUATION

given the following two WordNet senses:

bank1 Gloss: a financial institution that accepts deposits and channels the
money into lending activities
Examples: “he cashed a check at the bank”, “that bank holds the mortgage
on my home”
bank2 Gloss: sloping land (especially the slope beside a body of water)
Examples: “they pulled the canoe up on the bank”, “he sat on the bank of
the river and watched the currents”

Sense bank1 has two non-stopwords overlapping with the context in (C.31):
deposits and mortgage, while sense bank2 has zero words, so sense bank1 is chosen.
There are many obvious extensions to Simplified Lesk. The original Lesk algo-
rithm (Lesk, 1986) is slightly more indirect. Instead of comparing a target word’s
signature with the context words, the target signature is compared with the signatures
of each of the context words. For example, consider Lesk’s example of selecting the
appropriate sense of cone in the phrase pine cone given the following definitions for
pine and cone.
pine 1 kinds of evergreen tree with needle-shaped leaves
2 waste away through sorrow or illness
cone 1 solid body which narrows to a point
2 something of this shape whether solid or hollow
3 fruit of certain evergreen trees

In this example, Lesk’s method would select cone3 as the correct sense since two
of the words in its entry, evergreen and tree, overlap with words in the entry for pine,
whereas neither of the other entries has any overlap with words in the definition of
pine. In general Simplified Lesk seems to work better than original Lesk.
The primary problem with either the original or simplified approaches, how-
ever, is that the dictionary entries for the target words are short and may not provide
enough chance of overlap with the context.3 One remedy is to expand the list of
words used in the classifier to include words related to, but not contained in, their
individual sense definitions. But the best solution, if any sense-tagged corpus data
like SemCor is available, is to add all the words in the labeled corpus sentences for a
word sense into the signature for that sense. This version of the algorithm, the Cor-
Corpus Lesk pus Lesk algorithm, is the best-performing of all the Lesk variants (Kilgarriff and
Rosenzweig 2000, Vasilescu et al. 2004) and is used as a baseline in the SENSEVAL
competitions. Instead of just counting up the overlapping words, the Corpus Lesk
inverse
algorithm also applies a weight to each overlapping word. The weight is the inverse
document
frequency
document frequency or IDF, a standard information-retrieval measure introduced
IDF in Chapter 6. IDF measures how many different “documents” (in this case, glosses
and examples) a word occurs in and is thus a way of discounting function words.
Since function words like the, of, etc., occur in many documents, their IDF is very
low, while the IDF of content words is high. Corpus Lesk thus uses IDF instead of a
stop list.
Formally, the IDF for a word i can be defined as

Ndoc
idfi = log (C.32)
ndi
3 Indeed, Lesk (1986) notes that the performance of his system seems to roughly correlate with the
length of the dictionary entries.
C.6 • WSD: D ICTIONARY AND T HESAURUS M ETHODS 17

where Ndoc is the total number of “documents” (glosses and examples) and ndi is
the number of these documents containing word i.
Finally, we can combine the Lesk and supervised approaches by adding new
Lesk-like bag-of-words features. For example, the glosses and example sentences
for the target sense in WordNet could be used to compute the supervised bag-of-
words features in addition to the words in the SemCor context sentence for the sense
(Yuret, 2004).

C.6.2 Graph-based Methods

Another way to use a thesaurus like WordNet is to make use of the fact that WordNet
can be construed as a graph, with senses as nodes and relations between senses
as edges. In addition to the hypernymy and other relations, it’s possible to create
links between senses and those words in the gloss that are unambiguous (have only
one sense). Often the relations are treated as undirected edges, creating a large
undirected WordNet graph. Fig. C.10 shows a portion of the graph around the word
drink1v .

foodn1 liquidn1
helpingn1
1
beveragen milkn1

toastn4 drinkn1
sipv1
1
supv
1
consumev drinkv1 sipn1

drinkingn1
consumern1 drinkern1

potationn1
consumptionn1

Figure C.10 Part of the WordNet graph around drink1v , after Navigli and Lapata (2010).

There are various ways to use the graph for disambiguation, some using the
whole graph, some using only a subpart. For example the target word and the words
in its sentential context can all be inserted as nodes in the graph via a directed edge
to each of its senses. If we consider the sentence She drank some milk, Fig. C.11
shows a portion of the WordNet graph between the senses drink1v and milk1n .

drinkv1 drinkn1 milkn1

beveragen1

drinkv2
drinkern1 foodn1 milkn2

drinkv3 boozingn1
nutrimentn1 milkn3

drinkv4
milkn4
5
drinkv
“drink” “milk”

Figure C.11 Part of the WordNet graph between drink1v and milk1n , for disambiguating a
sentence like She drank some milk, adapted from Navigli and Lapata (2010).

The correct sense is then the one which is the most important or central in some
way in this graph. There are many different methods for deciding centrality. The
18 A PPENDIX C • W ORD N ET: W ORD R ELATIONS , S ENSES , AND D ISAMBIGUATION

degree simplest is degree, the number of edges into the node, which tends to correlate
with the most frequent sense. Another algorithm for assigning probabilities across
personalized
page rank nodes is personalized page rank, a version of the well-known pagerank algorithm
which uses some seed nodes. By inserting a uniform probability across the word
nodes (drink and milk in the example) and computing the personalized page rank of
the graph, the result will be a pagerank value for each node in the graph, and the
sense with the maximum pagerank can then be chosen. See Agirre et al. (2014) and
Navigli and Lapata (2010) for details.

C.7 Semi-Supervised WSD: Bootstrapping

Both the supervised approach and the dictionary-based approaches to WSD require
large hand-built resources: supervised training sets in one case, large dictionaries in
bootstrapping the other. We can instead use bootstrapping or semi-supervised learning, which
needs only a very small hand-labeled training set.
Yarowsky
algorithm A classic bootstrapping algorithm for WSD is the Yarowsky algorithm for
learning a classifier for a target word (in a lexical-sample task) (Yarowsky, 1995).
The algorithm is given a small seedset Λ0 of labeled instances of each sense and a
much larger unlabeled corpus V0 . The algorithm first trains an initial classifier on
the seedset Λ0 . It then uses this classifier to label the unlabeled corpus V0 . The
algorithm then selects the examples in V0 that it is most confident about, removes
them, and adds them to the training set (call it now Λ1 ). The algorithm then trains a
new classifier (a new set of rules) on Λ1 , and iterates by applying the classifier to the
now-smaller unlabeled set V1 , extracting a new training set Λ2 , and so on. With each
iteration of this process, the training corpus grows and the untagged corpus shrinks.
The process is repeated until some sufficiently low error-rate on the training set is
reached or until no further examples from the untagged corpus are above threshold.

? ? LIFE ? ? ? ? ? ? LIFE ? ? ? ?
? ? ? ?
? ? A ?? ? ? ? ? ? ? ? ? ? ? ? A ?? ? ? ? ? ? ? ? ? ?
? A ? ? ? ? A ? A A
?? ? A ?
? ? ? ? ? ? ?? ? ?? ? A ? ?
? ? ? ? ? ?
? ? A A A A A ? ? ? A A A A A ? ? ?
A AA ? ?? ? ? ? ? ? A AA ? ? ? ? ?
A A A MICROSCOPIC
?
? ? ? A
AA A A
A
? ? ? ?
?
V0 ?
A A ? A
AA A A
A
? ? ? ?
?
V1
? ? ? A A ? ? ? ? A A ? A A ? ? ? ?
A A ? ? A A ? ?
? ? ? ? ? ? ? ? ?
? ? ? ? ? ? A ? ?
? A
? ? ? ? ? ?
? ? ? ? ?? ? ? ANIMAL
? A ? ? ?
? ? ? ? ?? ?
? ? ? ? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ? ? ? ?
? EMPLOYEE ?
? ? ? Λ0? ? ? ?
?
? ? ?
? ?
?
? ? ? Λ1 ?
? ?
?
? ? ?
? ?
?

? ? ? ? ? ? ?
? ? ? B B ?
? ? ?? ? ? ? ? ? ? ? ?? ? B ? ? ?
? ? ? ? ? ? ? ?
? B ? ? B ?
? ?
?
? B B B ? ? ?
?
B B B B ?
? ? ? B B ? B B B B
? ? ? B B B B B ? ? B B B B B B
? ? B ? ?? B B
? ? ? ? ?
? ? ? ? ? ? ? ? ? MANUFACTURING
? ? ? ? ? ?
MANUFACTURING
? ? ? ? ? ? ? EQUIPMENT
? ? ? ?

(a) (b)
Figure C.12 The Yarowsky algorithm disambiguating “plant” at two stages; “?” indicates an unlabeled ob-
servation, A and B are observations labeled as SENSE-A or SENSE-B. The initial stage (a) shows only seed
sentences Λ0 labeled by collocates (“life” and “manufacturing”). An intermediate stage is shown in (b) where
more collocates have been discovered (“equipment”, “microscopic”, etc.) and more instances in V0 have been
moved into Λ1 , leaving a smaller unlabeled set V1 . Figure adapted from Yarowsky (1995).
C.8 • U NSUPERVISED W ORD S ENSE I NDUCTION 19

We need more good teachers – right now, there are only a half a dozen who can play
the free bass with ease.
An electric guitar and bass player stand off to one side, not really part of the scene,
The researchers said the worms spend part of their life cycle in such fish as Pacific
salmon and striped bass and Pacific rockfish or snapper.
And it all started when fishermen decided the striped bass in Lake Mead were...
Figure C.13 Samples of bass sentences extracted from the WSJ by using the simple corre-
lates play and fish.

Initial seeds can be selected by hand-labeling a small set of examples (Hearst,

1991), or by using the help of a heuristic. Yarowsky (1995) used the one sense
one sense per
collocation per collocation heuristic, which relies on the intuition that certain words or phrases
strongly associated with the target senses tend not to occur with the other sense.
Yarowsky defines his seedset by choosing a single collocation for each sense.
For example, to generate seed sentences for the fish and musical musical senses
of bass, we might come up with fish as a reasonable indicator of bass1 and play as
a reasonable indicator of bass2 . Figure C.13 shows a partial result of such a search
for the strings “fish” and “play” in a corpus of bass examples drawn from the WSJ.
The original Yarowsky algorithm also makes use of a second heuristic, called
one sense per
discourse one sense per discourse, based on the work of Gale et al. (1992b), who noticed that
a particular word appearing multiple times in a text or discourse often appeared with
the same sense. This heuristic seems to hold better for coarse-grained senses and
particularly for cases of homonymy rather than polysemy (Krovetz, 1998).
Nonetheless, it is still useful in a number of sense disambiguation situations. In
fact, the one sense per discourse heuristic is an important one throughout language
processing as it seems that many disambiguation tasks may be improved by a bias
toward resolving an ambiguity the same way inside a discourse segment.

C.8 Unsupervised Word Sense Induction

It is expensive and difficult to build large corpora in which each word is labeled for
its word sense. For this reason, an unsupervised approach to sense disambiguation,
word sense often called word sense induction or WSI, is an important direction. In unsu-
induction
pervised approaches, we don’t use human-defined word senses. Instead, the set of
“senses” of each word is created automatically from the instances of each word in
the training set.
Most algorithms for word sense induction use some sort of clustering over word
embeddings. (The earliest algorithms, due to Schütze (Schütze 1992, Schütze 1998),
represented each word as a context vector of bag-of-words features ~c.) Then in
training, we use three steps.
1. For each token wi of word w in a corpus, compute a context vector ~c.
2. Use a clustering algorithm to cluster these word-token context vectors~c into
a predefined number of groups or clusters. Each cluster defines a sense of w.
3. Compute the vector centroid of each cluster. Each vector centroid ~s j is a
sense vector representing that sense of w.
Since this is an unsupervised algorithm, we don’t have names for each of these
“senses” of w; we just refer to the jth sense of w.
20 A PPENDIX C • W ORD N ET: W ORD R ELATIONS , S ENSES , AND D ISAMBIGUATION

To disambiguate a particular token t of w we again have three steps:

1. Compute a context vector ~c for t.
2. Retrieve all sense vectors s j for w.
3. Assign t to the sense represented by the sense vector s j that is closest to t.
All we need is a clustering algorithm and a distance metric between vectors.
Clustering is a well-studied problem with a wide number of standard algorithms that
can be applied to inputs structured as vectors of numerical values (Duda and Hart,
1973). A frequently used technique in language applications is known as agglom-
agglomerative
clustering erative clustering. In this technique, each of the N training instances is initially
assigned to its own cluster. New clusters are then formed in a bottom-up fashion by
the successive merging of the two clusters that are most similar. This process con-
tinues until either a specified number of clusters is reached, or some global goodness
measure among the clusters is achieved. In cases in which the number of training
instances makes this method too expensive, random sampling can be used on the
original training set to achieve similar results.
How can we evaluate unsupervised sense disambiguation approaches? As usual,
the best way is to do extrinsic evaluation embedded in some end-to-end system; one
example used in a SemEval bakeoff is to improve search result clustering and di-
versification (Navigli and Vannella, 2013). Intrinsic evaluation requires a way to
map the automatically derived sense classes into a hand-labeled gold-standard set so
that we can compare a hand-labeled test set with a set labeled by our unsupervised
classifier. Various such metrics have been tested, for example in the SemEval tasks
(Manandhar et al. 2010, Navigli and Vannella 2013, Jurgens and Klapaftis 2013),
including cluster overlap metrics, or methods that map each sense cluster to a pre-
defined sense by choosing the sense that (in some training set) has the most overlap
with the cluster. However it is fair to say that no evaluation metric for this task has
yet become standard.

C.9 Summary
This chapter has covered a wide range of issues concerning the meanings associated
with lexical items. The following are among the highlights:
• A word sense is the locus of word meaning; definitions and meaning relations
are defined at the level of the word sense rather than wordforms.
• Homonymy is the relation between unrelated senses that share a form, and
polysemy is the relation between related senses that share a form.
• Hyponymy and hypernymy relations hold between words that are in a class-
inclusion relationship.
• WordNet is a large database of lexical relations for English.
• Word-sense disambiguation (WSD) is the task of determining the correct
sense of a word in context. Supervised approaches make use of sentences in
which individual words (lexical sample task) or all words (all-words task)
are hand-labeled with senses from a resource like WordNet.
• Classifiers for supervised WSD are generally trained on features of the sur-
rounding words.
• An important baseline for WSD is the most frequent sense, equivalent, in
WordNet, to take the first sense.
B IBLIOGRAPHICAL AND H ISTORICAL N OTES 21

• The Lesk algorithm chooses the sense whose dictionary definition shares the
most words with the target word’s neighborhood.
• Graph-based algorithms view the thesaurus as a graph and choose the sense
that is most central in some way.
• Word similarity can be computed by measuring the link distance in a the-
saurus or by various measures of the information content of the two nodes.

Bibliographical and Historical Notes

Word sense disambiguation traces its roots to some of the earliest applications of
digital computers. The insight that underlies modern algorithms for word sense
disambiguation was first articulated by Weaver (1955) in the context of machine
translation:
If one examines the words in a book, one at a time as through an opaque
mask with a hole in it one word wide, then it is obviously impossible
to determine, one at a time, the meaning of the words. [. . . ] But if
one lengthens the slit in the opaque mask, until one can see not only
the central word in question but also say N words on either side, then
if N is large enough one can unambiguously decide the meaning of the
central word. [. . . ] The practical question is : “What minimum value of
N will, at least in a tolerable fraction of cases, lead to the correct choice
of meaning for the central word?”
Other notions first proposed in this early period include the use of a thesaurus for dis-
ambiguation (Masterman, 1957), supervised training of Bayesian models for disam-
biguation (Madhu and Lytel, 1965), and the use of clustering in word sense analysis
(Sparck Jones, 1986).
An enormous amount of work on disambiguation was conducted within the con-
text of early AI-oriented natural language processing systems. Quillian (1968) and
Quillian (1969) proposed a graph-based approach to language understanding, in
which the dictionary definition of words was represented by a network of word nodes
connected by syntactic and semantic relations. He then proposed to do sense disam-
biguation by finding the shortest path between senses in the conceptual graph. Sim-
mons (1973) is another influential early semantic network approach. Wilks proposed
one of the earliest non-discrete models with his Preference Semantics (Wilks 1975c,
Wilks 1975b, Wilks 1975a), and Small and Rieger (1982) and Riesbeck (1975) pro-
posed understanding systems based on modeling rich procedural information for
each word. Hirst’s ABSITY system (Hirst and Charniak 1982, Hirst 1987, Hirst 1988),
which used a technique called marker passing based on semantic networks, repre-
sents the most advanced system of this type. As with these largely symbolic ap-
proaches, early neural network (at the time called ‘connectionist’) approaches to
word sense disambiguation relied on small lexicons with hand-coded representa-
tions (Cottrell 1985, Kawamoto 1988). Considerable work on sense disambiguation
has also been conducted in in psycholinguistics, under the name ’lexical ambiguity
resolution’. Small et al. (1988) present a variety of papers from this perspective.
The earliest implementation of a robust empirical approach to sense disambigua-
tion is due to Kelly and Stone (1975), who directed a team that hand-crafted a set
of disambiguation rules for 1790 ambiguous English words. Lesk (1986) was the
22 A PPENDIX C • W ORD N ET: W ORD R ELATIONS , S ENSES , AND D ISAMBIGUATION

first to use a machine-readable dictionary for word sense disambiguation. The prob-
lem of dictionary senses being too fine-grained has been addressed by clustering
coarse senses word senses into coarse senses (Dolan 1994, Chen and Chang 1998, Mihalcea and
Moldovan 2001, Agirre and de Lacalle 2003, Chklovski and Mihalcea 2003, Palmer
et al. 2004, Navigli 2006, Snow et al. 2007). Corpora with clustered word senses for
OntoNotes training clustering algorithms include Palmer et al. (2006) and OntoNotes (Hovy
et al., 2006).
Supervised approaches to disambiguation began with the use of decision trees by
Black (1988). The need for large amounts of annotated text in these methods led to
investigations into the use of bootstrapping methods (Hearst 1991, Yarowsky 1995).
Diab and Resnik (2002) give a semi-supervised algorithm for sense disambigua-
tion based on aligned parallel corpora in two languages. For example, the fact that
the French word catastrophe might be translated as English disaster in one instance
and tragedy in another instance can be used to disambiguate the senses of the two
English words (i.e., to choose senses of disaster and tragedy that are similar). Ab-
ney (2002) and Abney (2004) explore the mathematical foundations of the Yarowsky
algorithm and its relation to co-training. The most-frequent-sense heuristic is an ex-
tremely powerful one but requires large amounts of supervised training data.
The earliest use of clustering in the study of word senses was by Sparck Jones
(1986); Pedersen and Bruce (1997), Schütze (1997), and Schütze (1998) applied
distributional methods. Recent work on word sense induction has applied Latent
Dirichlet Allocation (LDA) (Boyd-Graber et al. 2007, Brody and Lapata 2009, Lau
et al. 2012). and large co-occurrence graphs (Di Marco and Navigli, 2013).
A collection of work concerning WordNet can be found in Fellbaum (1998).
Early work using dictionaries as lexical resources include Amsler’s (1981) use of the
Merriam Webster dictionary and Longman’s Dictionary of Contemporary English
(Boguraev and Briscoe, 1989).
Early surveys of WSD include Agirre and Edmonds (2006) and Navigli (2009).
See Pustejovsky (1995), Pustejovsky and Boguraev (1996), Martin (1986), and
Copestake and Briscoe (1995), inter alia, for computational approaches to the rep-
generative resentation of polysemy. Pustejovsky’s theory of the generative lexicon, and in
lexicon
qualia particular his theory of the qualia structure of words, is another way of accounting
structure
for the dynamic systematic polysemy of words in context.
Another important recent direction is the addition of sentiment and connotation
to knowledge bases (Wiebe et al. 2005, Qiu et al. 2009, Velikovich et al. 2010)
including SentiWordNet (Baccianella et al., 2010) and ConnotationWordNet (Kang
et al., 2014).

Exercises
C.1 Collect a small corpus of example sentences of varying lengths from any
newspaper or magazine. Using WordNet or any standard dictionary, deter-
mine how many senses there are for each of the open-class words in each sen-
tence. How many distinct combinations of senses are there for each sentence?
How does this number seem to vary with sentence length?
C.2 Using WordNet or a standard reference dictionary, tag each open-class word
in your corpus with its correct tag. Was choosing the correct sense always a
straightforward task? Report on any difficulties you encountered.
E XERCISES 23

C.3 Using your favorite dictionary, simulate the original Lesk word overlap dis-
ambiguation algorithm described on page 16 on the phrase Time flies like an
arrow. Assume that the words are to be disambiguated one at a time, from
left to right, and that the results from earlier decisions are used later in the
process.
C.4 Build an implementation of your solution to the previous exercise. Using
WordNet, implement the original Lesk word overlap disambiguation algo-
rithm described on page 16 on the phrase Time flies like an arrow.
24 Appendix C • WordNet: Word Relations, Senses, and Disambiguation

Abney, S. P. (2002). Bootstrapping. In ACL-02, pp. 360–367. Diab, M. and Resnik, P. (2002). An unsupervised method for
Abney, S. P. (2004). Understanding the Yarowsky algorithm. word sense tagging using parallel corpora. In ACL-02, pp.
Computational Linguistics, 30(3), 365–395. 255–262.
Agirre, E. and de Lacalle, O. L. (2003). Clustering WordNet Dolan, W. B. (1994). Word sense ambiguation: Clustering
word senses. In RANLP 2003. related senses. In COLING-94, Kyoto, Japan, pp. 712–716.
Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Duda, R. O. and Hart, P. E. (1973). Pattern Classification
Gonzalez-Agirre, A., Guo, W., Lopez-Gazpio, I., Maritx- and Scene Analysis. John Wiley and Sons.
alar, M., Mihalcea, R., Rigau, G., Uria, L., and Wiebe, Fellbaum, C. (Ed.). (1998). WordNet: An Electronic Lexical
J. (2015). 2015 SemEval-2015 Task 2: Semantic Textual Database. MIT Press.
Similarity, English, Spanish and Pilot on Interpretability. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan,
In SemEval-15, pp. 252–263. Z., Wolfman, G., and Ruppin, E. (2002). Placing search in
Agirre, E., Diab, M., Cer, D., and Gonzalez-Agirre, A. context: The concept revisited. ACM Transactions on In-
(2012). Semeval-2012 task 6: A pilot on semantic textual formation Systems, 20(1), 116––131.
similarity. In SemEval-12, pp. 385–393. Gale, W. A., Church, K. W., and Yarowsky, D. (1992a). Es-
Agirre, E. and Edmonds, P. (Eds.). (2006). Word Sense Dis- timating upper and lower bounds on the performance of
ambiguation: Algorithms and Applications. Kluwer. word-sense disambiguation programs. In ACL-92, Newark,
Agirre, E., López de Lacalle, O., and Soroa, A. (2014). Ran- DE, pp. 249–256.
dom walks for knowledge-based word sense disambigua- Gale, W. A., Church, K. W., and Yarowsky, D. (1992b). One
tion. Computational Linguistics, 40(1), 57–84. sense per discourse. In Proceedings DARPA Speech and
Amsler, R. A. (1981). A taxonomy of English nouns and Natural Language Workshop, pp. 233–237.
verbs. In ACL-81, Stanford, CA, pp. 133–138. Hearst, M. A. (1991). Noun homograph disambiguation. In
Atkins, S. (1993). Tools for computer-aided corpus lexicog- Proceedings of the 7th Conference of the University of Wa-
raphy: The Hector project. Acta Linguistica Hungarica, terloo Centre for the New OED and Text Research, pp. 1–
41, 5–72. 19.
Baccianella, S., Esuli, A., and Sebastiani, F. (2010). Sen- Hill, F., Reichart, R., and Korhonen, A. (2015). Simlex-999:
tiwordnet 3.0: An enhanced lexical resource for sentiment Evaluating semantic models with (genuine) similarity esti-
analysis and opinion mining.. In LREC-10, pp. 2200–2204. mation. Computational Linguistics, 41(4), 665–695.
Banerjee, S. and Pedersen, T. (2003). Extended gloss over- Hirst, G. (1987). Semantic Interpretation and the Resolution
laps as a measure of semantic relatedness. In IJCAI 2003, of Ambiguity. Cambridge University Press.
pp. 805–810. Hirst, G. (1988). Resolving lexical ambiguity computa-
Black, E. (1988). An experiment in computational discrim- tionally with spreading activation and polaroid words. In
ination of English word senses. IBM Journal of Research Small, S. L., Cottrell, G. W., and Tanenhaus, M. K. (Eds.),
and Development, 32(2), 185–194. Lexical Ambiguity Resolution, pp. 73–108. Morgan Kauf-
Boguraev, B. and Briscoe, T. (Eds.). (1989). Computational mann.
Lexicography for Natural Language Processing. Longman. Hirst, G. and Charniak, E. (1982). Word sense and case slot
Boyd-Graber, J., Blei, D. M., and Zhu, X. (2007). A topic disambiguation. In AAAI-82, pp. 95–98.
model for word sense disambiguation. In EMNLP/CoNLL Hovy, E. H., Marcus, M. P., Palmer, M., Ramshaw, L. A.,
2007. and Weischedel, R. (2006). Ontonotes: The 90% solution.
Brody, S. and Lapata, M. (2009). Bayesian word sense in- In HLT-NAACL-06.
duction. In EACL-09, pp. 103–111. Huang, E. H., Socher, R., Manning, C. D., and Ng, A. Y.
Bruce, R. F. and Wiebe, J. (1994). Word-sense disambigua- (2012). Improving word representations via global context
tion using decomposable models. In ACL-94, Las Cruces, and multiple word prototypes. In ACL 2012, pp. 873–882.
NM, pp. 139–145. Iacobacci, I., Pilehvar, M. T., and Navigli, R. (2016). Embed-
Chen, J. N. and Chang, J. S. (1998). Topical clustering dings for word sense disambiguation: An evaluation study.
of MRD senses based on information retrieval techniques. In ACL 2016, pp. 897–907.
Computational Linguistics, 24(1), 61–96. Jiang, J. J. and Conrath, D. W. (1997). Semantic similarity
Chklovski, T. and Mihalcea, R. (2003). Exploiting agree- based on corpus statistics and lexical taxonomy. In RO-
ment and disagreement of human annotators for word sense CLING X, Taiwan.
disambiguation. In RANLP 2003. Jurgens, D. and Klapaftis, I. P. (2013). Semeval-2013
Copestake, A. and Briscoe, T. (1995). Semi-productive pol- task 13: Word sense induction for graded and non-graded
ysemy and sense extension. Journal of Semantics, 12(1), senses. In *SEM, pp. 290–299.
15–68. Kang, J. S., Feng, S., Akoglu, L., and Choi, Y. (2014).
Cottrell, G. W. (1985). A Connectionist Approach to Connotationwordnet: Learning connotation over the word+
Word Sense Disambiguation. Ph.D. thesis, University of sense network. In ACL 2014.
Rochester, Rochester, NY. Revised version published by Kawamoto, A. H. (1988). Distributed representations of am-
Pitman, 1989. biguous words and their resolution in connectionist net-
Di Marco, A. and Navigli, R. (2013). Clustering and di- works. In Small, S. L., Cottrell, G. W., and Tanenhaus, M.
versifying web search results with graph-based word sense (Eds.), Lexical Ambiguity Resolution, pp. 195–228. Mor-
induction. Computational Linguistics, 39(3), 709–754. gan Kaufman.
Exercises 25

Kelly, E. F. and Stone, P. J. (1975). Computer Recognition Morris, W. (Ed.). (1985). American Heritage Dictionary
of English Word Senses. North-Holland. (2nd College Edition Ed.). Houghton Mifflin.
Kilgarriff, A. (2001). English lexical sample task descrip- Navigli, R. (2006). Meaningful clustering of senses helps
tion. In Proceedings of Senseval-2: Second International boost word sense disambiguation performance. In COL-
Workshop on Evaluating Word Sense Disambiguation Sys- ING/ACL 2006, pp. 105–112.
tems, Toulouse, France, pp. 17–20. Navigli, R. (2009). Word sense disambiguation: A survey.
Kilgarriff, A. and Palmer, M. (Eds.). (2000). Computing ACM Computing Surveys, 41(2).
and the Humanities: Special Issue on SENSEVAL, Vol. 34. Navigli, R. and Lapata, M. (2010). An experimental study
Kluwer. of graph connectivity for unsupervised word sense disam-
Kilgarriff, A. and Rosenzweig, J. (2000). Framework and biguation. IEEE Transactions on Pattern Analysis and Ma-
results for English SENSEVAL. Computers and the Hu- chine Intelligence, 32(4), 678–692.
manities, 34, 15–48. Navigli, R. and Vannella, D. (2013). Semeval-2013 task 11:
Krovetz, R. (1998). More than one sense per discourse. In Word sense induction & disambiguation within an end-user
Proceedings of the ACL-SIGLEX SENSEVAL Workshop. application. In *SEM, pp. 193–201.
Landauer, T. K. and Dumais, S. T. (1997). A solution to Palmer, M., Babko-Malaya, O., and Dang, H. T. (2004). Dif-
Plato’s problem: The Latent Semantic Analysis theory of ferent sense granularities for different applications. In HLT-
acquisition, induction, and representation of knowledge. NAACL Workshop on Scalable Natural Language Under-
Psychological Review, 104, 211–240. standing, Boston, MA, pp. 49–56.
Landes, S., Leacock, C., and Tengi, R. I. (1998). Building Palmer, M., Dang, H. T., and Fellbaum, C. (2006). Mak-
semantic concordances. In Fellbaum, C. (Ed.), WordNet: ing fine-grained and coarse-grained sense distinctions, both
An Electronic Lexical Database, pp. 199–216. MIT Press. manually and automatically. Natural Language Engineer-
Lau, J. H., Cook, P., McCarthy, D., Newman, D., and Bald- ing, 13(2), 137–163.
win, T. (2012). Word sense induction for novel sense de- Palmer, M., Fellbaum, C., Cotton, S., Delfs, L., and Dang,
tection. In EACL-12, pp. 591–601. H. T. (2001). English tasks: All-words and verb lexical
Leacock, C. and Chodorow, M. S. (1998). Combining lo- sample. In Proceedings of Senseval-2: 2nd International
cal context and WordNet similarity for word sense identi- Workshop on Evaluating Word Sense Disambiguation Sys-
fication. In Fellbaum, C. (Ed.), WordNet: An Electronic tems, Toulouse, France, pp. 21–24.
Lexical Database, pp. 265–283. MIT Press. Palmer, M., Ng, H. T., and Dang, H. T. (2006). Evalua-
Leacock, C., Towell, G., and Voorhees, E. M. (1993). tion of wsd systems. In Agirre, E. and Edmonds, P. (Eds.),
Corpus-based statistical sense resolution. In HLT-93, pp. Word Sense Disambiguation: Algorithms and Applications.
260–265. Kluwer.
Lesk, M. E. (1986). Automatic sense disambiguation us- Pedersen, T. and Bruce, R. (1997). Distinguishing word
ing machine readable dictionaries: How to tell a pine cone senses in untagged text. In EMNLP 1997, Providence, RI.
from an ice cream cone. In Proceedings of the 5th Inter- Ponzetto, S. P. and Navigli, R. (2010). Knowledge-rich word
national Conference on Systems Documentation, Toronto, sense disambiguation rivaling supervised systems. In ACL
CA, pp. 24–26. 2010, pp. 1522–1531.
Lin, D. (1998). An information-theoretic definition of simi- Pustejovsky, J. (1995). The Generative Lexicon. MIT Press.
larity. In ICML 1998, San Francisco, pp. 296–304.
Pustejovsky, J. and Boguraev, B. (Eds.). (1996). Lexical
Madhu, S. and Lytel, D. (1965). A figure of merit technique Semantics: The Problem of Polysemy. Oxford University
for the resolution of non-grammatical ambiguity. Mechan- Press.
ical Translation, 8(2), 9–13.
Qiu, G., Liu, B., Bu, J., and Chen, C. (2009). Expanding
Manandhar, S., Klapaftis, I. P., Dligach, D., and Pradhan, S. domain sentiment lexicon through double propagation.. In
(2010). Semeval-2010 task 14: Word sense induction & IJCAI-09, pp. 1199–1204.
disambiguation. In SemEval-2010, pp. 63–68.
Quillian, M. R. (1968). Semantic memory. In Minsky, M.
Martin, J. H. (1986). The acquisition of polysemy. In ICML (Ed.), Semantic Information Processing, pp. 227–270. MIT
1986, Irvine, CA, pp. 198–204. Press.
Masterman, M. (1957). The thesaurus in syntax and seman- Quillian, M. R. (1969). The teachable language comprehen-
tics. Mechanical Translation, 4(1), 1–2. der: A simulation program and theory of language. Com-
Mihalcea, R. (2007). Using wikipedia for automatic word munications of the ACM, 12(8), 459–476.
sense disambiguation. In NAACL-HLT 07, pp. 196–203. Resnik, P. (1995). Using information content to evaluate
Mihalcea, R. and Moldovan, D. (2001). Automatic genera- semantic similarity in a taxanomy. In International Joint
tion of a coarse grained WordNet. In NAACL Workshop on Conference for Artificial Intelligence (IJCAI-95), pp. 448–
WordNet and Other Lexical Resources. 453.
Miller, G. A. and Charles, W. G. (1991). Contextual cor- Riesbeck, C. K. (1975). Conceptual analysis. In Schank,
relates of semantics similarity. Language and Cognitive R. C. (Ed.), Conceptual Information Processing, pp. 83–
Processes, 6(1), 1–28. 156. American Elsevier, New York.
Miller, G. A., Leacock, C., Tengi, R. I., and Bunker, R. T. Rubenstein, H. and Goodenough, J. B. (1965). Contextual
(1993). A semantic concordance. In Proceedings ARPA correlates of synonymy. Communications of the ACM,
Workshop on Human Language Technology, pp. 303–308. 8(10), 627–633.
26 Appendix C • WordNet: Word Relations, Senses, and Disambiguation

Schütze, H. (1992). Dimensions of meaning. In Proceedings

of Supercomputing ’92, pp. 787–796. IEEE Press.
Schütze, H. (1997). Ambiguity Resolution in Language
Learning: Computational and Cognitive Models. CSLI
Publications, Stanford, CA.
Schütze, H. (1998). Automatic word sense discrimination.
Computational Linguistics, 24(1), 97–124.
Simmons, R. F. (1973). Semantic networks: Their com-
putation and use for understanding English sentences. In
Schank, R. C. and Colby, K. M. (Eds.), Computer Models
of Thought and Language, pp. 61–113. W.H. Freeman and
Co.
Small, S. L., Cottrell, G. W., and Tanenhaus, M. (Eds.).
(1988). Lexical Ambiguity Resolution. Morgan Kaufman.
Small, S. L. and Rieger, C. (1982). Parsing and compre-
hending with Word Experts. In Lehnert, W. G. and Ringle,
M. H. (Eds.), Strategies for Natural Language Processing,
pp. 89–147. Lawrence Erlbaum.
Snow, R., Prakash, S., Jurafsky, D., and Ng, A. Y. (2007).
Learning to merge word senses. In EMNLP/CoNLL 2007,
pp. 1005–1014.
Sparck Jones, K. (1986). Synonymy and Semantic Classifi-
cation. Edinburgh University Press, Edinburgh. Republi-
cation of 1964 PhD Thesis.
Vasilescu, F., Langlais, P., and Lapalme, G. (2004). Evaluat-
ing variants of the lesk approach for disambiguating words.
In LREC-04, Lisbon, Portugal, pp. 633–636. ELRA.
Velikovich, L., Blair-Goldensohn, S., Hannan, K., and Mc-
Donald, R. (2010). The viability of web-derived polarity
lexicons. In NAACL HLT 2010, pp. 777–785.
Weaver, W. (1949/1955). Translation. In Locke, W. N. and
Boothe, A. D. (Eds.), Machine Translation of Languages,
pp. 15–23. MIT Press. Reprinted from a memorandum
written by Weaver in 1949.
Wiebe, J., Wilson, T., and Cardie, C. (2005). Annotating ex-
pressions of opinions and emotions in language. Language
resources and evaluation, 39(2-3), 165–210.
Wilks, Y. (1975a). An intelligent analyzer and understander
of English. Communications of the ACM, 18(5), 264–274.
Wilks, Y. (1975b). Preference semantics. In Keenan, E. L.
(Ed.), The Formal Semantics of Natural Language, pp.
329–350. Cambridge Univ. Press.
Wilks, Y. (1975c). A preferential, pattern-seeking, semantics
for natural language inference. Artificial Intelligence, 6(1),
53–74.
Wu, Z. and Palmer, M. (1994). Verb semantics and lexical
selection. In ACL-94, Las Cruces, NM, pp. 133–138.
Yarowsky, D. (1995). Unsupervised word sense disambigua-
tion rivaling supervised methods. In ACL-95, Cambridge,
MA, pp. 189–196.
Yuret, D. (2004). Some experiments with a Naive Bayes
WSD system. In Senseval-3: 3rd International Workshop
on the Evaluation of Systems for the Semantic Analysis of
Text.
Zhong, Z. and Ng, H. T. (2010). It makes sense: A wide-
coverage word sense disambiguation system for free text.
In ACL 2010, pp. 78–83.

Westinghouse 9.5 in Air Compressor Manual From 1917
75% (4)
Westinghouse 9.5 in Air Compressor Manual From 1917
37 pages
18 Word Senses and WordNet
No ratings yet
18 Word Senses and WordNet
22 pages
Semantics 4 STDNT
No ratings yet
Semantics 4 STDNT
23 pages
6 (1)
No ratings yet
6 (1)
34 pages
Semantic and Pragmatic Courses: (Word Sense)
No ratings yet
Semantic and Pragmatic Courses: (Word Sense)
6 pages
ôn tập dãn luận full lý thuyết
No ratings yet
ôn tập dãn luận full lý thuyết
21 pages
Lecture 13
No ratings yet
Lecture 13
35 pages
Lexical Relation
No ratings yet
Lexical Relation
15 pages
Chapter 3, Reading material
No ratings yet
Chapter 3, Reading material
11 pages
Lexical Semantics
No ratings yet
Lexical Semantics
14 pages
Cohesion Is The Network of Lexical, Grammatical, and Other Relations Which Link Various Parts
No ratings yet
Cohesion Is The Network of Lexical, Grammatical, and Other Relations Which Link Various Parts
3 pages
Contents
No ratings yet
Contents
15 pages
SEMANTICS
No ratings yet
SEMANTICS
4 pages
SEMANTICS
No ratings yet
SEMANTICS
4 pages
Buoi 3 - Word meaning
No ratings yet
Buoi 3 - Word meaning
13 pages
Its Use in The Language' (In Other Words, The Role A Word Plays in The Language)
No ratings yet
Its Use in The Language' (In Other Words, The Role A Word Plays in The Language)
5 pages
Literal (Denotative) Meaning
No ratings yet
Literal (Denotative) Meaning
15 pages
Summary Semantic LESSON 6
No ratings yet
Summary Semantic LESSON 6
4 pages
Lexicalization Patterns
No ratings yet
Lexicalization Patterns
47 pages
Unit 3 Practices - Part 1
No ratings yet
Unit 3 Practices - Part 1
6 pages
The Concept of Meaning
No ratings yet
The Concept of Meaning
11 pages
Barque-Regular Polysemy in WordNet 2008
No ratings yet
Barque-Regular Polysemy in WordNet 2008
14 pages
Basic Ed9 1
No ratings yet
Basic Ed9 1
12 pages
SEMANTIKA
No ratings yet
SEMANTIKA
11 pages
Barker Lexical
No ratings yet
Barker Lexical
11 pages
Stronati Aicheler Sierra - Summary Fromkin Chapter 4 - The Meaning of Language
No ratings yet
Stronati Aicheler Sierra - Summary Fromkin Chapter 4 - The Meaning of Language
8 pages
Lexical Semantics KLP 8
No ratings yet
Lexical Semantics KLP 8
4 pages
Ch (3) Senses & Ambiguity(1)
No ratings yet
Ch (3) Senses & Ambiguity(1)
17 pages
Investigacion Alenah What Is Semantics
No ratings yet
Investigacion Alenah What Is Semantics
12 pages
0 Sense Relations (Entailment)
No ratings yet
0 Sense Relations (Entailment)
7 pages
The Concept and Use of Synonymy in English & Arabic
No ratings yet
The Concept and Use of Synonymy in English & Arabic
10 pages
S5 Semantics Handout#2
No ratings yet
S5 Semantics Handout#2
30 pages
Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
20 pages
Lapointe Standford
No ratings yet
Lapointe Standford
17 pages
Unit 11
No ratings yet
Unit 11
10 pages
8-Semantics Chapter 1 Last Version
No ratings yet
8-Semantics Chapter 1 Last Version
9 pages
NNHDC Chapter 6 Discourse
No ratings yet
NNHDC Chapter 6 Discourse
25 pages
Group 11 - Unit 11
No ratings yet
Group 11 - Unit 11
45 pages
Lexical Semantics: Thesaurus-Based
No ratings yet
Lexical Semantics: Thesaurus-Based
34 pages
English
No ratings yet
English
7 pages
Lexical Semantics: Prabhleen Juneja Tiet
No ratings yet
Lexical Semantics: Prabhleen Juneja Tiet
43 pages
Semantics
No ratings yet
Semantics
8 pages
OGrady Semantics
No ratings yet
OGrady Semantics
40 pages
Semantics Paper
No ratings yet
Semantics Paper
12 pages
Lexical Relations Final Version
No ratings yet
Lexical Relations Final Version
4 pages
SEMANTICS 2023-2024
No ratings yet
SEMANTICS 2023-2024
35 pages
Ambiguity
100% (1)
Ambiguity
6 pages
Hockey 46546546
No ratings yet
Hockey 46546546
6 pages
Sematic Kel 7
No ratings yet
Sematic Kel 7
8 pages
Semantics
No ratings yet
Semantics
2 pages
ENG 122 Introduction To Linguistics
No ratings yet
ENG 122 Introduction To Linguistics
51 pages
Chapter 2 Semantics: 2.1 Semantical Relations
No ratings yet
Chapter 2 Semantics: 2.1 Semantical Relations
25 pages
THEME 11: The Word As A Linguistic Sign. Homonymy. Synonymy. Antonymy. Lexical Creativity
No ratings yet
THEME 11: The Word As A Linguistic Sign. Homonymy. Synonymy. Antonymy. Lexical Creativity
10 pages
Semantics and Semiotics
100% (1)
Semantics and Semiotics
3 pages
Semantics (Meaning of Meaning)
No ratings yet
Semantics (Meaning of Meaning)
5 pages
B. Comrie - Action Nominals Between Verbs and Nouns
No ratings yet
B. Comrie - Action Nominals Between Verbs and Nouns
14 pages
Semantics
No ratings yet
Semantics
31 pages
Semantics
No ratings yet
Semantics
8 pages
Topic 11. Ok The Word As A Linguistic Sign. Homonymy, Synonymy, and Antonymy. False Friends and Lexical Creativity de Temario 2
No ratings yet
Topic 11. Ok The Word As A Linguistic Sign. Homonymy, Synonymy, and Antonymy. False Friends and Lexical Creativity de Temario 2
14 pages
Deixes
No ratings yet
Deixes
7 pages
"In the Original Text It Says": Word-Study Fallacies and How to Avoid Them
From Everand
"In the Original Text It Says": Word-Study Fallacies and How to Avoid Them
Benjamin J. Baxter
No ratings yet
Information Extraction - : Fraud Detection in Banking
No ratings yet
Information Extraction - : Fraud Detection in Banking
1 page
Contoh Soal N Gram (Bagus)
No ratings yet
Contoh Soal N Gram (Bagus)
2 pages
Tiwari PDF
No ratings yet
Tiwari PDF
101 pages
WDL Lab Manual As On 26 July PDF
No ratings yet
WDL Lab Manual As On 26 July PDF
89 pages
119 2000 Kluwer Academic Publishers
No ratings yet
119 2000 Kluwer Academic Publishers
2 pages
Diabetic Retinopathy (DR) Using Convolutional Neural Network (CNN)
No ratings yet
Diabetic Retinopathy (DR) Using Convolutional Neural Network (CNN)
17 pages
1-5 Dexta Verkstadshandbok 001-100
No ratings yet
1-5 Dexta Verkstadshandbok 001-100
100 pages
Download full Handbook of Developmental Disabilities 1st Edition Samuel L. Odom ebook all chapters
No ratings yet
Download full Handbook of Developmental Disabilities 1st Edition Samuel L. Odom ebook all chapters
77 pages
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
12 pages
Columbine Report Pgs 6901-7000
No ratings yet
Columbine Report Pgs 6901-7000
100 pages
Patent US6402830 - Lightweight Concrete Composition - Google Patentsuche PDF
No ratings yet
Patent US6402830 - Lightweight Concrete Composition - Google Patentsuche PDF
4 pages
TCAD Start
100% (1)
TCAD Start
31 pages
Hanuman Laxman Aroskar _ Union Of India_1701501400 (1)
No ratings yet
Hanuman Laxman Aroskar _ Union Of India_1701501400 (1)
93 pages
A Simple Quadrilateral Shell Element: September 1976)
No ratings yet
A Simple Quadrilateral Shell Element: September 1976)
9 pages
Tassement Consolidation
100% (1)
Tassement Consolidation
48 pages
TECHNIBBLE - Engagement Form - Home User
No ratings yet
TECHNIBBLE - Engagement Form - Home User
2 pages
Oculus Imagecam 3: Slit Lamp Documentation System
No ratings yet
Oculus Imagecam 3: Slit Lamp Documentation System
76 pages
PSIKOLOGI INDUSTRI - Work Readiness
No ratings yet
PSIKOLOGI INDUSTRI - Work Readiness
4 pages
Refresher Exam Takehome 1
100% (1)
Refresher Exam Takehome 1
16 pages
Samsung Vs Apple
No ratings yet
Samsung Vs Apple
13 pages
Ajay
100% (1)
Ajay
2 pages
Hazardous Area Classification
No ratings yet
Hazardous Area Classification
48 pages
Co TM
No ratings yet
Co TM
21 pages
Weighing Indicator: Operation Manual
No ratings yet
Weighing Indicator: Operation Manual
46 pages
Wind Load Calculations On Vessels
No ratings yet
Wind Load Calculations On Vessels
3 pages
Huawei TMA Support
No ratings yet
Huawei TMA Support
33 pages
Risk Assessment Example
No ratings yet
Risk Assessment Example
3 pages
White Paper Smart Cities Applications
No ratings yet
White Paper Smart Cities Applications
39 pages
Dynamic Router Configuration Protocol (DRCP)
No ratings yet
Dynamic Router Configuration Protocol (DRCP)
4 pages
Multiwii - Pro - MegaPirate Flight Controller With MS611
No ratings yet
Multiwii - Pro - MegaPirate Flight Controller With MS611
5 pages
Practical 6:-Gui Programming: (Using Tkinter/Wxpython/Pyqt)
No ratings yet
Practical 6:-Gui Programming: (Using Tkinter/Wxpython/Pyqt)
5 pages
National Institute of Technology Karnataka, Surathkal: NITK-Surathkal EE Dept
No ratings yet
National Institute of Technology Karnataka, Surathkal: NITK-Surathkal EE Dept
2 pages
Digital Pressure Gauges Additel 680 Series
No ratings yet
Digital Pressure Gauges Additel 680 Series
3 pages
(Psilocybin) Mushrooms - The Journal of Mushroom Cultivation No55 PDF
No ratings yet
(Psilocybin) Mushrooms - The Journal of Mushroom Cultivation No55 PDF
40 pages
Bis-Core English2
No ratings yet
Bis-Core English2
3 pages

C Class Notes

Uploaded by

C Class Notes

Uploaded by

Speech and Language Processing. Daniel Jurafsky & James H. Martin.

WordNet: Word Relations,

C.1 Word Senses

C.1.1 Relations Between Senses

And we mentioned that in practice, the word synonym is commonly used to

meronymy Meronymy Another common relation is meronymy, the part-whole relation. A

C.2 WordNet: A Database of Lexical Relations

The noun “bass” has 8 senses in WordNet.

Relation Also Called Definition Example

Relation Definition Example

to the notion of immediate hyponymy discussed on page 4. Each synset is related

C.3 Word Similarity: Thesaurus Methods

thesaurus graph between the sense nodes c1 and c2

wordsim(w1 , w2 ) = max sim(c1 , c2 ) (C.18)

0.000113 natural-elevation shore 0.0000836

0.0000189 hill coast 0.0000216

IC(c) = − log P(c) (C.20)

simresnik (c1 , c2 ) = − log P(LCS(c1 , c2 )) (C.21)

• Commonality: the more information A and B have in common, the more

distJC (c1 , c2 ) = 2 × log P(LCS(c1 , c2 )) − (log P(c1 ) + log P(c2 )) (C.27)

• drawing paper: paper that is specially prepared for use in drafting

similarity(A,B) = overlap(gloss(A), gloss(B))

Figure C.7 Five thesaurus-based (and dictionary-based) similarity measures.

Evaluating Thesaurus-Based Similarity

of such human-labeled datasets: the RG-65 dataset of human similarity ratings on

C.4 Word Sense Disambiguation: Overview

WordNet Spanish Roget

C.5 Supervised Word Sense Disambiguation

would yield the following vector:

C.5.1 Wikipedia as a source of training data

C.6 WSD: Dictionary and Thesaurus Methods

C.6.1 The Lesk Algorithm

function S IMPLIFIED L ESK(word, sentence) returns best sense of word

best-sense ← most frequent sense for word

As an example of the Lesk algorithm at work, consider disambiguating the word

given the following two WordNet senses:

C.6.2 Graph-based Methods

drinkv1 drinkn1 milkn1

C.7 Semi-Supervised WSD: Bootstrapping

Initial seeds can be selected by hand-labeling a small set of examples (Hearst,

C.8 Unsupervised Word Sense Induction

To disambiguate a particular token t of w we again have three steps:

Bibliographical and Historical Notes

Schütze, H. (1992). Dimensions of meaning. In Proceedings

You might also like