0% found this document useful (0 votes)
15 views

NLP1 Lecture5

This document discusses lexical semantics and word meanings. It covers topics like compositional semantics, prototype theory, semantic relations, polysemy, and word sense disambiguation. Various approaches to representing word meanings are also examined.

Uploaded by

mausam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

NLP1 Lecture5

This document discusses lexical semantics and word meanings. It covers topics like compositional semantics, prototype theory, semantic relations, polysemy, and word sense disambiguation. Various approaches to representing word meanings are also examined.

Uploaded by

mausam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Natural Language Processing 1

Natural Language Processing 1


Lecture 5: Lexical and distributional semantics

Katia Shutova

ILLC
University of Amsterdam

12 November 2018
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics

Semantics

Compositional semantics:
I studies how meanings of phrases are constructed out of
the meaning of individual words
I principle of compositionality: meaning of each whole
phrase derivable from meaning of its parts
I sentence structure conveys some meaning: obtained by
syntactic representation

Lexical semantics:
I studies how the meanings of individual words can be
represented and induced
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Words and concepts

What is lexical meaning?

I recent results in psychology and cognitive neuroscience


give us some clues
I but we don’t have the whole picture yet
I different representations proposed, e.g.
I formal semantic representations based on logic,
I or taxonomies relating words to each other,
I or distributional representations in statistical NLP
I but none of the representations gives us a complete
account of lexical meaning
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Words and concepts

How to approach lexical meaning?


I Formal semantics: set-theoretic approach
e.g., cat0 : the set of all cats; bird0 : the set of all birds.
I meaning postulates, e.g.

∀x[bachelor0 (x) → man0 (x) ∧ unmarried0 (x)]

I Limitations, e.g. is the current Pope a bachelor?


I Defining concepts through enumeration of all of their
features in practice is highly problematic
I How would you define e.g. chair, tomato, thought,
democracy ? – impossible for most concepts
I Prototype theory offers an alternative to set-theoretic
approaches
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Words and concepts

How to approach lexical meaning?


I Formal semantics: set-theoretic approach
e.g., cat0 : the set of all cats; bird0 : the set of all birds.
I meaning postulates, e.g.

∀x[bachelor0 (x) → man0 (x) ∧ unmarried0 (x)]

I Limitations, e.g. is the current Pope a bachelor?


I Defining concepts through enumeration of all of their
features in practice is highly problematic
I How would you define e.g. chair, tomato, thought,
democracy ? – impossible for most concepts
I Prototype theory offers an alternative to set-theoretic
approaches
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Words and concepts

Prototype theory

I introduced the notion of graded semantic categories


I no clear boundaries
I no requirement that a property or set of properties be
shared by all members
I certain members of a category are more central or
prototypical (i.e. instantiate the prototype)
furniture: chair is more prototypical than stool

Eleanor Rosch 1975. Cognitive Representation of Semantic


Categories (J Experimental Psychology)
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Words and concepts

Prototype theory (continued)

I Categories form around prototypes; new members added


on basis of resemblance to prototype
I Features/attributes generally graded
I Category membership a matter of degree
I Categories do not have clear boundaries
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Semantic relations

Semantic relations

Hyponymy: IS-A

dog is a hyponym of animal


animal is a hypernym of dog

I hyponymy relationships form a taxonomy


I works best for concrete nouns
I multiple inheritance: e.g., is coin a hyponym of both metal
and money?
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Semantic relations

Other semantic relations

Meronomy: PART-OF e.g., arm is a meronym of body, steering


wheel is a meronym of car (piece vs part)
Synonymy e.g., aubergine/eggplant.
Antonymy e.g., big/little
Also:
Near-synonymy/similarity e.g., exciting/thrilling
e.g., slim/slender/thin/skinny
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Semantic relations

WordNet

I large scale, open source resource for English


I hand-constructed
I wordnets being built for other languages
I organized into synsets: synonym sets (near-synonyms)
I synsets connected by semantic relations

S: (v) interpret, construe, see (make sense of;


assign a meaning to) - "How do you interpret his
behavior?"
S: (v) understand, read, interpret, translate (make
sense of a language) "She understands French";
"Can you read Greek?"
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Polysemy

Polysemy and word senses

The children ran to the store


If you see this man, run!
Service runs all the way to Cranbury
She is running a relief operation in Sudan
the story or argument runs as follows
Does this old car still run well?
Interest rates run from 5 to 10 percent
Who’s running for treasurer this year?
They ran the tapes over and over again
These dresses run small
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Polysemy

Polysemy
I homonymy: unrelated word senses. bank (raised land) vs
bank (financial institution)
I bank (financial institution) vs bank (in a casino): related but
distinct senses.
I regular polysemy and sense extension
I zero-derivation, e.g. tango (N) vs tango (V), or rabbit,
turkey, halibut (meat / animal)
I metaphorical senses, e.g. swallow [food], swallow
[information], swallow [anger]
I metonymy, e.g. he played Bach; he drank his glass.
I vagueness: nurse, lecturer, driver
I cultural stereotypes: nurse, lecturer, driver
No clearcut distinctions.
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Polysemy

Word sense disambiguation


I Needed for many applications
I relies on context, e.g. collocations: striped bass (the fish)
vs bass guitar.
Methods:
I supervised learning:
I Assume a predefined set of word senses, e.g. WordNet
I Need a large sense-tagged training corpus (difficult to
construct)
I semi-supervised learning (Yarowsky, 1995)
I bootstrap from a few examples
I unsupervised sense induction
I e.g. cluster contexts in which a word occurs
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Word sense disambiguation

WSD by semi-supervised learning

Yarowsky, David (1995) Unsupervised word sense


disambiguation rivalling supervised methods

Disambiguating plant (factory vs vegetation senses):


1. Find contexts in training corpus:
sense training example
? company said that the plant is still operating
? although thousands of plant and animal species
? zonal distribution of plant life
? company manufacturing plant is in Orlando
etc
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Word sense disambiguation

Yarowsky (1995): schematically

Initial state

? ?? ? ?
? ?
? ??
? ? ?
?? ?
? ? ?
?? ? ? ?
? ?
? ? ?
? ?
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Word sense disambiguation

2. Identify some seeds to disambiguate a few uses:

‘plant life’ for vegetation use (A)


‘manufacturing plant’ for factory use (B)

sense training example


? company said that the plant is still operating
? although thousands of plant and animal species
A zonal distribution of plant life
B company manufacturing plant is in Orlando
etc
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Word sense disambiguation

Seeds

? ?? ? B
? B
? ?? manu.
? ? ?
?? life ?
A ? ?
?? A A ?
? A
? A ?
? ?
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Word sense disambiguation

3. Train a decision list classifier on Sense A/Sense B examples.


Rank features by log-likelihood ratio:
 
P(SenseA |fi )
log
P(SenseB |fi )

reliability criterion sense


8.10 plant life A
7.58 manufacturing plant B
6.27 animal within 10 words of plant A
etc
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Word sense disambiguation

4. Apply the classifier to the training set and add reliable


examples to A and B sets.
sense training example

? company said that the plant is still operating


A although thousands of plant and animal species
A zonal distribution of plant life
B company manufacturing plant is in Orlando
etc
5. Iterate the previous steps 3 and 4 until convergence
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Word sense disambiguation

Iterating:

? ?? ? B
B B
? ??
animal ? ? ?
company
AA B
A ? ?
?? A A ?
? A
? A ?
? ?
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Word sense disambiguation

Final:

A AA B B
B B
A BB
A B B
AA B
A B B
AA A A B
A A
A A B
A B
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Word sense disambiguation

6. Apply the classifier to the unseen test data

I ‘one sense per discourse’: can be used as an additional


refinement
I Yarowsky’s experiments were nearly all on homonyms:
these principles may not hold as well for sense extension.
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Word sense disambiguation

Problems with WSD as supervised classification

Yarowsky reported an accuracy of 95%, but ...


I on ’easy’ homonymous examples
I real performance around 75% (supervised)
I need to predefine word senses (not theoretically sound)
I need a very large training corpus (difficult to annotate,
humans do not agree)
I learn a model for individual words — no real generalisation
Better way:
I unsupervised sense induction (but a very hard task)
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Word sense disambiguation

Distributional hypothesis

You shall know a word by the company it keeps (Firth)


The meaning of a word is defined by the way it is used
(Wittgenstein).

it was authentic scrumpy, rather sharp and very strong


we could taste a famous local product — scrumpy
spending hours in the pub drinking scrumpy
Cornish Scrumpy Medium Dry. £19.28 - Case
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Word sense disambiguation

Distributional hypothesis

You shall know a word by the company it keeps (Firth)


The meaning of a word is defined by the way it is used
(Wittgenstein).

it was authentic scrumpy, rather sharp and very strong


we could taste a famous local product — scrumpy
spending hours in the pub drinking scrumpy
Cornish Scrumpy Medium Dry. £19.28 - Case
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Word sense disambiguation

Distributional hypothesis

You shall know a word by the company it keeps (Firth)


The meaning of a word is defined by the way it is used
(Wittgenstein).

it was authentic scrumpy, rather sharp and very strong


we could taste a famous local product — scrumpy
spending hours in the pub drinking scrumpy
Cornish Scrumpy Medium Dry. £19.28 - Case
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Word sense disambiguation

Distributional hypothesis

You shall know a word by the company it keeps (Firth)


The meaning of a word is defined by the way it is used
(Wittgenstein).

it was authentic scrumpy, rather sharp and very strong


we could taste a famous local product — scrumpy
spending hours in the pub drinking scrumpy
Cornish Scrumpy Medium Dry. £19.28 - Case
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Word sense disambiguation

Distributional hypothesis

You shall know a word by the company it keeps (Firth)


The meaning of a word is defined by the way it is used
(Wittgenstein).

it was authentic scrumpy, rather sharp and very strong


we could taste a famous local product — scrumpy
spending hours in the pub drinking scrumpy
Cornish Scrumpy Medium Dry. £19.28 - Case
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Word sense disambiguation

Scrumpy
Natural Language Processing 1
Lecture 5: Introduction to semantics & lexical semantics
Word sense disambiguation

Distributional hypothesis

This leads to the distributional hypothesis about word meaning:


I the context surrounding a given word provides information
about its meaning;
I words are similar if they share similar linguistic contexts;
I semantic similarity ≈ distributional similarity.
Natural Language Processing 1
Models

Distributional semantics

Distributional semantics: family of techniques for representing


word meaning based on (linguistic) contexts of use.

1. Count-based models:
I Vector space models
I dimensions correspond to elements in the context
I words are represented as vectors, or higher-order tensors

2. Prediction models:
I Train a model to predict plausible contexts for a word
I learn word representations in the process
Natural Language Processing 1
Count-based models

Count-based approaches: the general intuition

I The semantic space has dimensions which correspond to


possible contexts – features.
I For our purposes, a distribution can be seen as a point in
that space (the vector being defined with respect to the
origin of that space).
I scrumpy [...pub 0.8, drink 0.7, strong 0.4, joke 0.2,
mansion 0.02, zebra 0.1...]
Natural Language Processing 1
Count-based models

Vectors
Natural Language Processing 1
Count-based models

Feature matrix

feature1 feature2 ... featuren


word1 f1,1 f2,1 fn,1
word2 f1,2 f2,2 fn,2
...
wordm f1,m f2,m fn,m
Natural Language Processing 1
Count-based models

The notion of context

1 Word windows (unfiltered): n words on either side of the


lexical item.
Example: n=2 (5 words window):
| The prime minister acknowledged the |
question.
minister [ the 2, prime 1, acknowledged 1, question 0 ]
Natural Language Processing 1
Count-based models

Context

2 Word windows (filtered): n words on either side removing


some words (e.g. function words, some very frequent
content words). Stop-list or by POS-tag.
Example: n=2 (5 words window), stop-list:
| The prime minister acknowledged the |
question.
minister [ prime 1, acknowledged 1, question 0 ]
Natural Language Processing 1
Count-based models

Context

3 Lexeme window (filtered or unfiltered); as above but using


stems.
Example: n=2 (5 words window), stop-list:
| The prime minister acknowledged the |
question.
minister [ prime 1, acknowledge 1, question 0 ]
Natural Language Processing 1
Count-based models

Context

4 Dependencies (directed links between heads and


dependents). Context for a lexical item is the dependency
structure it belongs to (various definitions).
Example:
The prime minister acknowledged the question.
minister [ prime_a 1, acknowledge_v 1]
minister [ prime_a_mod 1, acknowledge_v_subj 1]
minister [ prime_a 1, acknowledge_v+question_n 1]
Natural Language Processing 1
Count-based models

Parsed vs unparsed data: examples

word (unparsed) word (parsed)


meaning_n or_c+phrase_n
derive_v and_c+phrase_n
dictionary_n syllable_n+of_p
pronounce_v play_n+on_p
phrase_n etymology_n+of_p
latin_j portmanteau_n+of_p
ipa_n and_c+deed_n
verb_n meaning_n+of_p
mean_v from_p+language_n
hebrew_n pron_rel_+utter_v
usage_n for_p+word_n
literally_r in_p+sentence_n
Natural Language Processing 1
Count-based models

Dependency vectors
word (Subj) word (Dobj)
come_v use_v
mean_v say_v
go_v hear_v
speak_v take_v
make_v speak_v
say_v find_v
seem_v get_v
follow_v remember_v
give_v read_v
describe_v write_v
get_v utter_v
appear_v know_v
begin_v understand_v
sound_v believe_v
occur_v choose_v
Natural Language Processing 1
Count-based models

Context weighting

I Binary model: if context c co-occurs with word w, value of


~ for dimension c is 1, 0 otherwise.
vector w
... [a long long long example for a distributional
semantics] model... (n=4)
... {a 1} {dog 0} {long 1} {sell 0} {semantics 1}...
I Basic frequency model: the value of vector w ~ for dimension
c is the number of times that c co-occurs with w.
... [a long long long example for a distributional
semantics] model... (n=4)
... {a 2} {dog 0} {long 3} {sell 0} {semantics 1}...
Natural Language Processing 1
Count-based models

Characteristic model
I Weights given to the vector components express how
characteristic a given context is for word w.
I Pointwise Mutual Information (PMI)
P(w, c) P(w)P(c|w) P(c|w)
PMI(w, c) = log = log = log
P(w)P(c) P(w)P(c) P(c)

f (c) f (w, c)
P(c) = P , P(c|w) = ,
k f (ck ) f (w)
P
f (w, c) k f (ck )
PMI(w, c) = log
f (w)f (c)
f (w, c): frequency of word w in context c
f (w): frequency of word w in all contexts
f (c): frequency of context c
Natural Language Processing 1
Count-based models

What semantic space?

I Entire vocabulary.
I + All information included – even rare contexts
I - Inefficient (100,000s dimensions). Noisy (e.g.
002.png|thumb|right|200px|graph_n). Sparse
I Top n words with highest frequencies.
I + More efficient (2000-10000 dimensions). Only ‘real’
words included.
I - May miss out on infrequent but relevant contexts.
Natural Language Processing 1
Count-based models

Word frequency: Zipfian distribution


Natural Language Processing 1
Count-based models

What semantic space?

I Entire vocabulary.
I + All information included – even rare contexts
I - Inefficient (100,000s dimensions). Noisy (e.g.
002.png|thumb|right|200px|graph_n). Sparse.
I Top n words with highest frequencies.
I + More efficient (2000-10000 dimensions). Only ‘real’
words included.
I - May miss out on infrequent but relevant contexts.
Natural Language Processing 1
Count-based models

What semantic space?

I Singular Value Decomposition (SVD): the number of


dimensions is reduced by exploiting redundancies in the
data.
I + Very efficient (200-500 dimensions). Captures
generalisations in the data.
I - SVD matrices are not interpretable.
I Non-negative matrix factorization (NMF)
I Similar to SVD in spirit, but performs factorization differently
Natural Language Processing 1
Getting distributions from text

Our reference text

Douglas Adams, Mostly harmless


The major difference between a thing that might go wrong and
a thing that cannot possibly go wrong is that when a thing that
cannot possibly go wrong goes wrong it usually turns out to be
impossible to get at or repair.
I Example: Produce distributions using a word window,
PMI-based model
Natural Language Processing 1
Getting distributions from text

The semantic space

Douglas Adams, Mostly harmless


The major difference between a thing that might go wrong and
a thing that cannot possibly go wrong is that when a thing that
cannot possibly go wrong goes wrong it usually turns out to be
impossible to get at or repair.
I Assume only keep open-class words.
I Dimensions:

difference impossible thing


get major turns
go possibly usually
goes repair wrong
Natural Language Processing 1
Getting distributions from text

Frequency counts...

Douglas Adams, Mostly harmless


The major difference between a thing that might go wrong and
a thing that cannot possibly go wrong is that when a thing that
cannot possibly go wrong goes wrong it usually turns out to be
impossible to get at or repair.
I Counts:

difference 1 impossible 1 thing 3


get 1 major 1 turns 1
go 3 possibly 2 usually 1
goes 1 repair 1 wrong 4
Natural Language Processing 1
Getting distributions from text

Conversion into 5-word windows...

Douglas Adams, Mostly harmless


The major difference between a thing that might go wrong and
a thing that cannot possibly go wrong is that when a thing that
cannot possibly go wrong goes wrong it usually turns out to be
impossible to get at or repair.
I ∅ ∅ the major difference
I ∅ the major difference between
I the major difference between a
I major difference between a thing
I ...
Natural Language Processing 1
Getting distributions from text

Distribution for wrong

Douglas Adams, Mostly harmless


The major difference between a thing that [might go wrong and
a] thing that cannot [possibly go wrong is that] when a thing that
cannot [possibly go [wrong goes wrong] it usually] turns out to
be impossible to get at or repair.
I Distribution (frequencies):

difference 0 impossible 0 thing 0


get 0 major 0 turns 0
go 3 possibly 2 usually 1
goes 2 repair 0 wrong 2
Natural Language Processing 1
Getting distributions from text

Distribution for wrong

Douglas Adams, Mostly harmless


The major difference between a thing that [might go wrong and
a] thing that cannot [possibly go wrong is that] when a thing that
cannot [possibly go [wrong goes wrong] it usually] turns out to
be impossible to get at or repair.
I Distribution (PPMIs):

difference 0 impossible 0 thing 0


get 0 major 0 turns 0
go 0.70 possibly 0.70 usually 0.70
goes 1 repair 0 wrong 0.40

You might also like