Week 8
Week 8
EL
Pawan Goyal
Week 8, Lecture 1
N
Definition
Lexical semantics is concerned with the systematic meaning related
connections among lexical items, and the internal meaning-related structure of
individual lexical items.
EL
PT
N
Definition
Lexical semantics is concerned with the systematic meaning related
connections among lexical items, and the internal meaning-related structure of
individual lexical items.
EL
To identify the semantics of lexical items, we need to focus on the notion of
lexeme, an individual entry in the lexicon.
PT
N
Definition
Lexical semantics is concerned with the systematic meaning related
connections among lexical items, and the internal meaning-related structure of
individual lexical items.
EL
To identify the semantics of lexical items, we need to focus on the notion of
lexeme, an individual entry in the lexicon.
What is a lexeme? PT
Lexeme should be thought of as a pairing of a particular orthographic and
N
phonological form with some sort of symbolic meaning representation.
Orthographic form, and phonological form refer to the appropriate form
part of a lexeme
Sense refers to a lexeme’s meaning counterpart.
EL
PT
N
EL
red n. the color of blood or a ruby
blood n. the red liquid that circulates in the heart, arteries and veins of
animals
PT
N
EL
red n. the color of blood or a ruby
blood n. the red liquid that circulates in the heart, arteries and veins of
animals
PT
The entries are description of lexemes in terms of other lexemes
N
Definitions make it clear that right and left are similar kind of lexemes that
stand in some kind of alternation, or opposition, to one another
We can glean that red is a color, it can be applied to both blood and
rubies, and that blood is a liquid.
Homonymy
EL
Polysemy
Synonymy
Antonymy
Hypernymy
Hyponymy
PT
N
Meronymy
EL
PT
N
Examples
EL
Bat (wooden stick-like thing) vs Bat (flying mammal thing)
Bank (financial institution) vs Bank (riverside)
PT
N
Examples
EL
Bat (wooden stick-like thing) vs Bat (flying mammal thing)
Bank (financial institution) vs Bank (riverside)
Text-to-Speech
Same orthographic form but different phonological form
EL
PT
N
Text-to-Speech
Same orthographic form but different phonological form
EL
Information Retrieval
PT
Different meaning but same orthographic form
N
Text-to-Speech
Same orthographic form but different phonological form
EL
Information Retrieval
PT
Different meaning but same orthographic form
Speech Recognition
N
to, two, too
Perfect homonyms are also problematic
EL
PT
N
EL
Are those the same sense?
PT
Sense 1: “The building belonging to a financial institution”
Sense 2: “A financial institution”
N
Another example
Heavy snow caused the roof of the school to collapse.
The school hired more teachers this year than ever before.
EL
PT
N
EL
More examples:
Austen)
PT
Author (Jane Austen wrote Emma) ↔ Works of Author (I really love Jane
Zeugma test
Which of these flights serve breakfast?
EL
Does Midwest Express serve Philadelphia?
PT
N
Zeugma test
Which of these flights serve breakfast?
EL
Does Midwest Express serve Philadelphia?
*Does Midwest Express serve breakfast and San Jose?
PT
N
Zeugma test
Which of these flights serve breakfast?
EL
Does Midwest Express serve Philadelphia?
*Does Midwest Express serve breakfast and San Jose?
PT
Combine two separate uses of a lexeme into a single example using
conjunction
N
Since it sounds weird, we say that these are two different senses of serve.
EL
couch / sofa
big / large
automobile / car
vomit / throw up
water / H2 O
PT
N
Two lexemes are synonyms if they can be successfully substituted for each
other in all situations.
EL
Would I be flying on a large or small plane?
PT
N
EL
Would I be flying on a large or small plane?
PT
Miss Nelson, for instance, became a kind of big sister to Benjamin.
*Miss Nelson, for instance, became a kind of large sister to Benjamin.
N
EL
Would I be flying on a large or small plane?
PT
Miss Nelson, for instance, became a kind of big sister to Benjamin.
*Miss Nelson, for instance, became a kind of large sister to Benjamin.
N
Why?
big has a sense that means being older, or grown up
large lacks this sense
Shades of meaning
What is the cheapest first class fare?
EL
*What is the cheapest first class price?
PT
N
Shades of meaning
What is the cheapest first class fare?
EL
*What is the cheapest first class price?
Collocational constraints
PT
We frustate ’em and frustate ’em, and pretty soon they make a big
mistake.
N
*We frustate ’em and frustate ’em, and pretty soon they make a large
mistake.
Senses that are opposites with respect to one feature of their meaning
Otherwise, they are similar!
I dark / light
EL
I short / long
I hot / cold
I up / down
I in / out
PT
N
Senses that are opposites with respect to one feature of their meaning
Otherwise, they are similar!
I dark / light
EL
I short / long
I hot / cold
I up / down
I in / out
Hyponymy
One sense is a hyponym of another if the first sense is more specific, denoting
a subclass of the other
EL
car is a hyponym of vehicle
dog is a hyponym of animal
mango is a hyponym of fruit
PT
N
Hyponymy
One sense is a hyponym of another if the first sense is more specific, denoting
a subclass of the other
EL
car is a hyponym of vehicle
dog is a hyponym of animal
mango is a hyponym of fruit
Hypernymy PT
N
Conversely
vehicle is a hypernym/superordinate of car
animal is a hypernym of dog
fruit is a hypernym of mango
Entailment
EL
Sense A is a hyponym of sense B if being an A entails being a B.
Ex: dog, animal
Transitivity
PT
A hypo B and B hypo C entails A hypo C
N
Definition
Meronymy: an asymmetric, transitive relation between senses.
EL
X is a meronym of Y if it denotes a part of Y .
The inverse relation is holonymy.
PT
meronym
porch
wheel
holonym
house
car
N
leg chair
nose face
EL
Pawan Goyal
Week 8, Lecture 2
N
https://wordnet.princeton.edu/wordnet/
A hierarchically organized lexical database
EL
A machine-readable thesaurus, and aspects of a dictionary
Versions for other languages are under development
PT
part of speech
noun
verb
no. synsets
82,115
13,767
N
adjective 18,156
adverb 3,621
EL
Example: chump as a noun to mean ‘a person who is gullible and easy to
take advantage of’
PT
Each of these senses share this same gloss.
N
For WordNet, the meaning of this sense of chump is this list.
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
I Word similarity or
I Word distance
PT
N
EL
I Word similarity or
I Word distance
Two words are more similar
I
PT
If they share more features of meaning
N
EL
I Word similarity or
I Word distance
Two words are more similar
I
PT
If they share more features of meaning
Actually these are really relations between senses:
I Instead of saying “bank is like fund”
N
I We say
F Bank1 is similar to fund3
F Bank2 is similar to slope5
EL
I Word similarity or
I Word distance
Two words are more similar
I
PT
If they share more features of meaning
Actually these are really relations between senses:
I Instead of saying “bank is like fund”
N
I We say
F Bank1 is similar to fund3
F Bank2 is similar to slope5
EL
Distributional algorithms
By comparing words based on their distributional context in the corpora
Thesaurus-based algorithms
PT
Based on whether words are “nearby” in WordNet
N
EL
I Glosses and example sentences
PT
N
EL
I Glosses and example sentences
In practice, “thesaurus-based” methods usually use:
I the is-a/subsumption/hypernymy hierarchy
I
PT
and sometimes the glosses too
N
EL
I Glosses and example sentences
In practice, “thesaurus-based” methods usually use:
I the is-a/subsumption/hypernymy hierarchy
I
PT
and sometimes the glosses too
Word similarity vs. word relatedness
I Similar words are near-synonyms
N
I Related words could be related any way
F car, gasoline : related, but nor similar
F car, bicycle: similar
Basic Idea
EL
Two words are similar if they are nearby in the hypernym graph
PT
N
Basic Idea
EL
Two words are similar if they are nearby in the hypernym graph
pathlen(c1 , c2 ) = number of edges in shortest path (in hypernym graph)
between senses c1 and c2
PT
N
Basic Idea
EL
Two words are similar if they are nearby in the hypernym graph
pathlen(c1 , c2 ) = number of edges in shortest path (in hypernym graph)
between senses c1 and c2
simpath (c1 , c2 ) =
PT 1
1+pathlen(c1 ,c2 )
N
Basic Idea
EL
Two words are similar if they are nearby in the hypernym graph
pathlen(c1 , c2 ) = number of edges in shortest path (in hypernym graph)
between senses c1 and c2
simpath (c1 , c2 ) =
PT 1
1+pathlen(c1 ,c2 )
sim(w1 , w2 ) = maxc1 ∈senses(w1 ),c2 ∈senses(w2 ) sim(c1 , c2 )
N
EL
PT
N
L-C similarity
EL
d: maximum depth of the hierarchy
PT
N
L-C similarity
EL
d: maximum depth of the hierarchy
L-C similarity
EL
d: maximum depth of the hierarchy
Cencept probabilities
For each concept (synset) c, let P(c) be the probability that a randomly
EL
selected word in a corpus is an instance (hyponym) of c
PT
N
Cencept probabilities
For each concept (synset) c, let P(c) be the probability that a randomly
EL
selected word in a corpus is an instance (hyponym) of c
P(ROOT) = 1
The lower a node in the hierarchy, the lower its probability
PT
N
Cencept probabilities
For each concept (synset) c, let P(c) be the probability that a randomly
EL
selected word in a corpus is an instance (hyponym) of c
P(ROOT) = 1
The lower a node in the hierarchy, the lower its probability
EL
PT
N
EL
PT
N
Information content
EL
Information content: IC(c) = −logP(c)
Lowest common subsumer : LCS(c1 , c2 ): the lowest node in the hierarchy
PT
that subsumes (is a hypernym of) both c1 and c2
We are now ready to see how to use information content (IC) as a
similarity metric.
N
EL
PT
N
Resnik Similarity
EL
Intuition: how similar two words are depends on how much they have in
common
common subsumer
PT
It measures the commonality by the information content of the lowest
EL
PT
N
EL
Lin: The more information content they don’t share, the less similar they
are
PT
Not the absolute quantity of shared information but the proportion of
shared information
N
2logP(LCS(c1 , c2 ))
simLin (c1 , c2 ) =
logP(c1 ) + logP(c2 )
The information content common to c1 and c2 , normalized by their average
information content.
EL
PT
N
JC similarity
We can use IC to assign lengths to graph edges:
EL
distJC (c, hypernym(c)) = IC(c) − IC(hypernym(c))
PT
distJC (c1 , c2 ) = distJC (c1 , LCS(c1 , c2 )) + distJC (c2 , LCS(c1 , c2 ))
= IC(c1 ) − IC(LCS(c1 , c2 )) + IC(c2 ) − IC(LCS(c1 , c2 ))
= IC(c1 ) + IC(c2 ) − 2 × IC(LCS(c1 , c2 ))
N
1
simJC (c1 , c2 ) =
IC(c1 ) + IC(c2 ) − 2 × IC(LCS(c1 , c2 ))
EL
PT
N
EL
I Drawing paper: paper that is specially prepared for use in drafting
I Decal: the art of transferring designs from specially prepared paper to a
wood or glass or metal surface
PT
N
EL
I Drawing paper: paper that is specially prepared for use in drafting
I Decal: the art of transferring designs from specially prepared paper to a
wood or glass or metal surface
PT
For each n-word phrase that occurs in both glosses, add a score of n2
N
EL
I Drawing paper: paper that is specially prepared for use in drafting
I Decal: the art of transferring designs from specially prepared paper to a
wood or glass or metal surface
PT
For each n-word phrase that occurs in both glosses, add a score of n2
paper and specially prepared → 1 + 4 = 5
N
EL
I saw a man who is 98 years old and can still walk and tell jokes
PT
N
EL
PT
N
EL
Pawan Goyal
Week 8, Lecture 3
N
Sense ambiguity
Many words have several meanings or senses
The meaning of bass depends on the context
EL
Are we talking about music, or fish?
I An electric guitar and bass player stand off to one side, not really part of
the scene, just as a sort of nod to gringo expectations perhaps.
I
Sense ambiguity
Many words have several meanings or senses
The meaning of bass depends on the context
EL
Are we talking about music, or fish?
I An electric guitar and bass player stand off to one side, not really part of
the scene, just as a sort of nod to gringo expectations perhaps.
I
Sense ambiguity
Many words have several meanings or senses
The meaning of bass depends on the context
EL
Are we talking about music, or fish?
I An electric guitar and bass player stand off to one side, not really part of
the scene, just as a sort of nod to gringo expectations perhaps.
I
EL
I Overlap Based Approaches
Machine Learning Based Approaches
I Supervised Approaches
I
I PT
Semi-supervised Algorithms
Unsupervised Algorithms
Hybrid Approaches
N
EL
Find the overlap between the features of different senses of an
ambiguous word (sense bag) and the features of the words in its context
(context bag).
etc. PT
The features could be sense definitions, example sentences, hypernyms
EL
PT
N
EL
PT
N
EL
On burning coal we get ash.
PT
N
EL
On burning coal we get ash.
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
A context word will add 1 to the score of the sense if the thesaurus
category of the word matches that of the sense.
PT
N
EL
A context word will add 1 to the score of the sense if the thesaurus
category of the word matches that of the sense.
I E.g. The money in this bank fetches an interest of 8% per annum
PT
N
EL
A context word will add 1 to the score of the sense if the thesaurus
category of the word matches that of the sense.
I E.g. The money in this bank fetches an interest of 8% per annum
I
I
Target word: bank
PT
Clue words from the context: money, interest, annum, fetch
N
EL
A context word will add 1 to the score of the sense if the thesaurus
category of the word matches that of the sense.
I E.g. The money in this bank fetches an interest of 8% per annum
I
I
Target word: bank
PT
Clue words from the context: money, interest, annum, fetch
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
EL
Using Bayes’ law, this can be expressed as:
P(s)P(f |s)
PT
ŝ = arg max
s∈S
EL
Using Bayes’ law, this can be expressed as:
P(s)P(f |s)
PT
ŝ = arg max
s∈S
EL
their POS’s
I Co-occurrence vector
PT
N
EL
their POS’s
I Co-occurrence vector
Set parameters of Naïve Bayes using maximum likelihood estimation
(MLE) from training data
PT count(si , wj )
N
P(si ) =
count(wj )
count(fj , si )
P(fj |si ) =
count(si )
EL
Collect a large set of collocations for the ambiguous word
Calculate word-sense probability distributions for all such collocations
PT
N
EL
Collect a large set of collocations for the ambiguous word
Calculate word-sense probability distributions for all such collocations
Calculate the log-likelihood ratio
PT
log(
P(Sense − A|Collocationi )
P(Sense − B|Collocationi )
)
N
Higher log-likelihood ⇒ more predictive evidence
EL
Collect a large set of collocations for the ambiguous word
Calculate word-sense probability distributions for all such collocations
Calculate the log-likelihood ratio
PT
log(
P(Sense − A|Collocationi )
P(Sense − B|Collocationi )
)
N
Higher log-likelihood ⇒ more predictive evidence
Collocations are ordered in a decision list, with most predictive
collocations ranked highest
EL
PT
N
EL
PT
N
Classification of a test sentence is based on the highest ranking collocation,
found in the test sentences.
EL
PT
N
EL
Pawan Goyal
Week 8, Lecture 4
N
EL
“Bootstrapping” or co-training
I Start with (small) seed, learn decision list
I Use decision list to label rest of corpus
PT
I Retain ‘confident’ labels, treat as annotated data to learn new decision list
I Repeat . . .
N
EL
“Bootstrapping” or co-training
I Start with (small) seed, learn decision list
I Use decision list to label rest of corpus
PT
I Retain ‘confident’ labels, treat as annotated data to learn new decision list
I Repeat . . .
Heuristics (derived from observation):
N
I One sense per discourse
I One sense per collocation
EL
A word tends to preserve its meaning across all its occurrences in a given
discourse
PT
N
EL
A word tends to preserve its meaning across all its occurrences in a given
discourse
Example
EL
Disambiguating plant (industrial sense) vs. plant (living thing sense)
Think of seed features for each sense
I Industrial sense: co-occurring with ‘manufacturing’
I
PT
Living thing sense: co-occurring with ‘life’
Use ‘one sense per collocation’ to build initial decision list classifier
Treat results (having high probability) as annotated data, train new
N
decision list classifier, iterate
EL
PT
N
EL
PT
N
EL
PT
N
EL
PT
N
Termination
Stop when
EL
I Error on training data is less than a threshold
I No more training data is covered
Use final decision list for WSD
Advantages PT
N
Accuracy is about as good as a supervised algorithm
Bootstrapping: far less manual effort
EL
contexts for a word.
PT
N
EL
contexts for a word.
PT
N
EL
In each high density component one of the nodes (hub) has a higher
degree than the others.
Step 1: Construct co-occurrence graph, G.
PT
N
EL
In each high density component one of the nodes (hub) has a higher
degree than the others.
Step 1: Construct co-occurrence graph, G.
PT
Step 2: Arrange nodes in G in decreasing order of degree.
N
EL
In each high density component one of the nodes (hub) has a higher
degree than the others.
Step 1: Construct co-occurrence graph, G.
PT
Step 2: Arrange nodes in G in decreasing order of degree.
Step 3: Select the node from G which has the highest degree. This node
N
will be the hub of the first high density component.
EL
In each high density component one of the nodes (hub) has a higher
degree than the others.
Step 1: Construct co-occurrence graph, G.
PT
Step 2: Arrange nodes in G in decreasing order of degree.
Step 3: Select the node from G which has the highest degree. This node
N
will be the hub of the first high density component.
Step 4: Delete this hub and all its neighbors from G.
EL
In each high density component one of the nodes (hub) has a higher
degree than the others.
Step 1: Construct co-occurrence graph, G.
PT
Step 2: Arrange nodes in G in decreasing order of degree.
Step 3: Select the node from G which has the highest degree. This node
N
will be the hub of the first high density component.
Step 4: Delete this hub and all its neighbors from G.
Step 5: Repeat Step 3 and 4 to detect the hubs of other high density
components
EL
PT
N
EL
The distance between two nodes is measured as the smallest sum of
weights of the edges on the paths linking them.
PT
N
EL
The distance between two nodes is measured as the smallest sum of
weights of the edges on the paths linking them.
PT
Computing distance between two nodes wi and wj
EL
PT
N
EL
A score vector s is associated with each wj ∈ W(j , i), such that sk
represents the contribution of the kth hub as:
sk =
si
PT 1
1 + d(hk , wj )
= 0 otherwise.
if hk is an ancestor of wj
N
All score vectors associated with all wj ∈ W(j , i) are summed up
The hub which receives the maximum score is chosen as the most
appropriate sense
EL
Pawan Goyal
Week 8, Lecture 5
N
Pawan Goyal (IIT Kharagpur) Novel Word Sense Detection Week 8, Lecture 5 1/5
Tracking Sense Changes
Classical sense
EL
1
PT
N
1
http://www.merriam-webster.com/
Pawan Goyal (IIT Kharagpur) Novel Word Sense Detection Week 8, Lecture 5 2/5
Tracking Sense Changes
Classical sense
EL
1
Novel sense
PT
N
1
http://www.merriam-webster.com/
Pawan Goyal (IIT Kharagpur) Novel Word Sense Detection Week 8, Lecture 5 2/5
Comparing sense clusters
EL
PT
N
Pawan Goyal (IIT Kharagpur) Novel Word Sense Detection Week 8, Lecture 5 3/5
Comparing sense clusters
EL
PT
N
Pawan Goyal (IIT Kharagpur) Novel Word Sense Detection Week 8, Lecture 5 3/5
Split, join, birth and death
EL
PT
N
Pawan Goyal (IIT Kharagpur) Novel Word Sense Detection Week 8, Lecture 5 4/5
A real example of birth
EL
PT
N
Pawan Goyal (IIT Kharagpur) Novel Word Sense Detection Week 8, Lecture 5 5/5