0% found this document useful (0 votes)
10 views66 pages

Vector Semantics and Embedding (Part 1)

The document discusses the complexities of word meaning, emphasizing the need for a theory that encompasses various linguistic relations such as synonymy, antonymy, and connotation. It introduces vector semantics as a computational model to represent word meanings through embeddings, allowing for better handling of word similarity and context. Additionally, it covers techniques like TF-IDF and PPMI for measuring word relevance and associations in natural language processing.

Uploaded by

Phương Trang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views66 pages

Vector Semantics and Embedding (Part 1)

The document discusses the complexities of word meaning, emphasizing the need for a theory that encompasses various linguistic relations such as synonymy, antonymy, and connotation. It introduces vector semantics as a computational model to represent word meanings through embeddings, allowing for better handling of word similarity and context. Additionally, it covers techniques like TF-IDF and PPMI for measuring word relevance and associations in natural language processing.

Uploaded by

Phương Trang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Word Meaning

Vector
Semantics &
Embeddings
What do words mean?
N-gram or text classification methods we've seen so far
◦ Words are just strings (or indices wi in a vocabulary list)
◦ That's not very satisfactory!
Introductory logic classes:
◦ The meaning of "dog" is DOG; cat is CAT
∀x DOG(x) ⟶ MAMMAL(x)
Old linguistics joke by Barbara Partee in 1967:
◦ Q: What's the meaning of life?
◦ A: LIFE
That seems hardly better!
Desiderata
What should a theory of word meaning do for us?
Let's look at some desiderata
From lexical semantics, the linguistic study of word
meaning
Lemmas and senses
lemma

mouse (N)
sense
1. any of numerous small rodents...
2. a hand-operated device that controls
a cursor... Modified from the online thesaurus WordNet

A sense or “concept” is the meaning component of a word


Lemmas can be polysemous (have multiple senses)
Relations between senses: Synonymy
Synonyms have the same meaning in some or all
contexts.
◦ filbert / hazelnut
◦ couch / sofa
◦ big / large
◦ automobile / car
◦ vomit / throw up
◦ water / H20
Relations between senses: Synonymy
Note that there are probably no examples of perfect
synonymy.
◦ Even if many aspects of meaning are identical
◦ Still may differ based on politeness, slang, register,
genre, etc.
Relation: Synonymy?
water/H20
"H20" in a surfing guide?
big/large
my big sister != my large sister
The Linguistic Principle of Contrast

Difference in form 🡪 difference in meaning


Abbé Gabriel Girard 1718
Re: "exact" synonyms
"

"

[I do not believe that there


is a synonymous word in any
language]

Thanks to Mark Aronoff!


Relation: Similarity
Words with similar meanings. Not synonyms, but sharing
some element of meaning

car, bicycle
cow, horse
Ask humans how similar 2 words are

word1 word2 similarity


vanish disappear 9.8
behave obey 7.3
belief impression 5.95
muscle bone 3.65
modest flexible 0.98
hole agreement 0.3

SimLex-999 dataset (Hill et al., 2015)


Relation: Word relatedness
Also called "word association"
Words can be related in any way, perhaps via a semantic
frame or field

◦ coffee, tea: similar


◦ coffee, cup: related, not similar
Semantic field
Words that
◦ cover a particular semantic domain
◦ bear structured relations with each other.

hospitals
surgeon, scalpel, nurse, anaesthetic, hospital
restaurants
waiter, menu, plate, food, menu, chef
houses
door, roof, kitchen, family, bed
Relation: Antonymy
Senses that are opposites with respect to only one
feature of meaning
Otherwise, they are very similar!
dark/light short/long fast/slow rise/fall
hot/cold up/down in/out
More formally: antonyms can
◦ define a binary opposition or be at opposite ends of a scale
◦ long/short, fast/slow
◦ Be reversives:
◦ rise/fall, up/down
Connotation (sentiment)

• Words have affective meanings


• Positive connotations (happy)
• Negative connotations (sad)
• Connotations can be subtle:
• Positive connotation: copy, replica, reproduction
• Negative connotation: fake, knockoff, forgery
• Evaluation (sentiment!)
• Positive evaluation (great, love)
• Negative evaluation (terrible, hate)
Connotation
Osgood et al. (1957)

Words seem to vary along 3 affective dimensions:


◦ valence: the pleasantness of the stimulus
◦ arousal: the intensity of emotion provoked by the stimulus
◦ dominance: the degree of control exerted by the stimulus
Word Score Word Score
Valence love 1.000 toxic 0.008
happy 1.000 nightmare 0.005
Arousal elated 0.960 mellow 0.069
frenzy 0.965 napping 0.046
Dominance powerful 0.991 weak 0.045
leadership 0.983 empty 0.081

Values from NRC VAD Lexicon (Mohammad 2018)


So far
Concepts or word senses
◦ Have a complex many-to-many association with words
(homonymy, multiple senses)
Have relations with each other
◦ Synonymy
◦ Antonymy
◦ Similarity
◦ Relatedness
◦ Connotation
Word Meaning
Vector
Semantics &
Embeddings
Vector Semantics
Vector
Semantics &
Embeddings
Computational models of word meaning

Can we build a theory of how to represent word


meaning, that accounts for at least some of the
desiderata?
We'll introduce vector semantics
The standard model in language processing!
Handles many of our goals!
Ludwig Wittgenstein

PI #43:
"The meaning of a word is its use in the language"
Let's define words by their usages
One way to define "usage":
words are defined by their environments (the words around them)

Zellig Harris (1954):


If A and B have almost identical environments we say that they
are synonyms.
What does recent English borrowing ongchoi
mean?
Suppose you see these sentences:
• Ong choi is delicious sautéed with garlic.
• Ong choi is superb over rice
• Ong choi leaves with salty sauces
And you've also seen these:
• …spinach sautéed with garlic over rice
• Chard stems and leaves are delicious
• Collard greens and other salty leafy greens
Conclusion:
◦ Ongchoi is a leafy green like spinach, chard, or collard greens
◦ We could conclude this based on words like "leaves" and "delicious" and "sauteed"
Ongchoi: Ipomoea aquatica "Water Spinach"

空心菜
kangkong
rau muống

Yamaguchi, Wikimedia Commons, public domain


Idea 1: Defining meaning by linguistic distribution

Let's define the meaning of a word by its


distribution in language use, meaning its
neighboring words or grammatical environments.
Idea 2: Meaning as a point in space (Osgood et al.
1957)
3 affective dimensions for a word
◦ valence: pleasantness
◦ arousal: intensity of emotion
◦ dominance: the degree of control exerted
Word Score Word Score
Valence love 1.000 toxic 0.008
happy 1.000 nightmare 0.005
Arousal elated 0.960 mellow 0.069 NRC VAD Lexicon
frenzy 0.965 napping 0.046 (Mohammad 2018)

Dominance powerful 0.991 weak 0.045


◦ leadership 0.983 empty 0.081

Hence the connotation of a word is a vector in 3-space


Idea 1: Defining meaning by linguistic distribution

Idea 2: Meaning as a point in multidimensional space


Defining meaning as a point in space based on distribution
Each word = a vector (not just "good" or "w45")
Similar words are "nearby in semantic space"
We build this space automatically by seeing which words are
nearby in text
We define meaning of a word as a
vector
Called an "embedding" because it's embedded into a
space (see textbook)
The standard way to represent meaning in NLP
Every modern NLP algorithm uses embeddings as
the representation of word meaning
Fine-grained model of meaning for similarity
Intuition: why vectors?
Consider sentiment analysis:
◦ With words, a feature is a word identity
◦ Feature 5: 'The previous word was "terrible"'
◦ requires exact same word to be in training and test
◦ With embeddings:
◦ Feature is a word vector
◦ 'The previous word was vector [35,22,17…]
◦ Now in the test set we might see a similar vector [34,21,14]
◦ We can generalize to similar but unseen words!!!
We'll discuss 2 kinds of
embeddings
tf-idf
◦ Information Retrieval workhorse!
◦ A common baseline model
◦ Sparse vectors
◦ Words are represented by (a simple function of) the counts of nearby
words
Word2vec
◦ Dense vectors
◦ Representation is created by training a classifier to predict whether a
word is likely to appear nearby
◦ Later we'll discuss extensions called contextual embeddings
From now on:
Computing with meaning representations
instead of string representations
Vector Semantics
Vector
Semantics &
Embeddings
Words and Vectors
Vector
Semantics &
Embeddings
Term-document matrix
Each document is represented by a vector of words
Visualizing document vectors
Vectors are the basis of information retrieval

Vectors are similar for the two comedies

But comedies are different than the other two


Comedies have more fools and wit and fewer battles.
Idea for word meaning: Words can be vectors
too!!!

battle is "the kind of word that occurs in Julius Caesar and Henry V"

fool is "the kind of word that occurs in comedies, especially Twelfth Night"
More common: word-word matrix
(or "term-context matrix")
Two words are similar in meaning if their context vectors are similar
Words and Vectors
Vector
Semantics &
Embeddings
Cosine for computing word similarity
Vector
Semantics &
Embeddings
Computing word similarity: Dot product and
cosine
The dot product between two vectors is a scalar:

The dot product tends to be high when the two


vectors have large values in the same dimensions
Dot product can thus be a useful similarity metric
between vectors
Problem with raw dot-product
Dot product favors long vectors
Dot product is higher if a vector is longer (has higher
values in many dimension)
Vector length:

Frequent words (of, the, you) have long vectors (since


they occur many times with other words).
So dot product overly favors frequent words
Alternative: cosine for computing word similarity

Based on the definition of the dot product between two vectors a and b
Cosine as a similarity metric

-1: vectors point in opposite directions


+1: vectors point in same directions
0: vectors are orthogonal

But since raw frequency values are non-negative, the


46

cosine for term-term matrix vectors ranges from 0–1


Cosine examples
pie data computer
cherry 442 8 2
digital 5 1683 1670
information 5 3982 3325

47
Visualizing cosines
(well, angles)
Cosine for computing word
Vector similarity
Semantics &
Embeddings
TF-IDF
Vector
Semantics &
Embeddings
But raw frequency is a bad representation

• The co-occurrence matrices we have seen represent each


cell by word frequencies.
• Frequency is clearly useful; if sugar appears a lot near
apricot, that's useful information.
• But overly frequent words like the, it, or they are not very
informative about the context
• It's a paradox! How can we balance these two conflicting
constraints?
Two common solutions for word weighting

Words like "the" or "it" have very low idf

See if words like "good" appear more often with "great"


than we would expect by chance
Term frequency (tf)
tft,d = count(t,d)

Instead of using raw count, we squash a bit:

tft,d = log10(count(t,d)+1)
Document frequency (df)
dft is the number of documents t occurs in.
(note this is not collection frequency: total count across
all documents)
"Romeo" is very distinctive for one Shakespeare play:
Inverse document frequency (idf)

N is the total number of documents


in the collection
What is a document?

Could be a play or a Wikipedia article


But for the purposes of tf-idf, documents can be
anything; we often call each paragraph a document!
Final tf-idf weighted value for a word
Raw counts:

tf-idf:
TF-IDF
Vector
Semantics &
Embeddings
PPMI
Vector
Semantics &
Embeddings
Pointwise Mutual Information
Positive Pointwise Mutual Information
Computing PPMI on a term-context
matrix
Matrix F with W rows (words) and C columns (contexts)
fij is # of times wi occurs in context cj

62
p(w=information,c=data) = 3982/111716 = .3399
p(w=information) = 7703/11716 = .6575
p(c=data) = 5673/11716 = .4842

63
pmi(information,data) = log2 (.3399 / (.6575*.4842) ) = .0944

Resulting PPMI matrix (negatives replaced by 0)

64
Weighting PMI
PMI is biased toward infrequent events
◦ Very rare words have very high PMI values
Two solutions:
◦ Give rare words slightly higher probabilities
◦ Use add-one smoothing (which has a similar effect)

65
Weighting PMI: Giving rare context words
slightly higher probability

66

You might also like