Vector Semantics and Embedding (Part 1)
Vector Semantics and Embedding (Part 1)
Vector
Semantics &
Embeddings
What do words mean?
N-gram or text classification methods we've seen so far
◦ Words are just strings (or indices wi in a vocabulary list)
◦ That's not very satisfactory!
Introductory logic classes:
◦ The meaning of "dog" is DOG; cat is CAT
∀x DOG(x) ⟶ MAMMAL(x)
Old linguistics joke by Barbara Partee in 1967:
◦ Q: What's the meaning of life?
◦ A: LIFE
That seems hardly better!
Desiderata
What should a theory of word meaning do for us?
Let's look at some desiderata
From lexical semantics, the linguistic study of word
meaning
Lemmas and senses
lemma
mouse (N)
sense
1. any of numerous small rodents...
2. a hand-operated device that controls
a cursor... Modified from the online thesaurus WordNet
"
car, bicycle
cow, horse
Ask humans how similar 2 words are
hospitals
surgeon, scalpel, nurse, anaesthetic, hospital
restaurants
waiter, menu, plate, food, menu, chef
houses
door, roof, kitchen, family, bed
Relation: Antonymy
Senses that are opposites with respect to only one
feature of meaning
Otherwise, they are very similar!
dark/light short/long fast/slow rise/fall
hot/cold up/down in/out
More formally: antonyms can
◦ define a binary opposition or be at opposite ends of a scale
◦ long/short, fast/slow
◦ Be reversives:
◦ rise/fall, up/down
Connotation (sentiment)
PI #43:
"The meaning of a word is its use in the language"
Let's define words by their usages
One way to define "usage":
words are defined by their environments (the words around them)
空心菜
kangkong
rau muống
…
battle is "the kind of word that occurs in Julius Caesar and Henry V"
fool is "the kind of word that occurs in comedies, especially Twelfth Night"
More common: word-word matrix
(or "term-context matrix")
Two words are similar in meaning if their context vectors are similar
Words and Vectors
Vector
Semantics &
Embeddings
Cosine for computing word similarity
Vector
Semantics &
Embeddings
Computing word similarity: Dot product and
cosine
The dot product between two vectors is a scalar:
Based on the definition of the dot product between two vectors a and b
Cosine as a similarity metric
47
Visualizing cosines
(well, angles)
Cosine for computing word
Vector similarity
Semantics &
Embeddings
TF-IDF
Vector
Semantics &
Embeddings
But raw frequency is a bad representation
tft,d = log10(count(t,d)+1)
Document frequency (df)
dft is the number of documents t occurs in.
(note this is not collection frequency: total count across
all documents)
"Romeo" is very distinctive for one Shakespeare play:
Inverse document frequency (idf)
tf-idf:
TF-IDF
Vector
Semantics &
Embeddings
PPMI
Vector
Semantics &
Embeddings
Pointwise Mutual Information
Positive Pointwise Mutual Information
Computing PPMI on a term-context
matrix
Matrix F with W rows (words) and C columns (contexts)
fij is # of times wi occurs in context cj
62
p(w=information,c=data) = 3982/111716 = .3399
p(w=information) = 7703/11716 = .6575
p(c=data) = 5673/11716 = .4842
63
pmi(information,data) = log2 (.3399 / (.6575*.4842) ) = .0944
64
Weighting PMI
PMI is biased toward infrequent events
◦ Very rare words have very high PMI values
Two solutions:
◦ Give rare words slightly higher probabilities
◦ Use add-one smoothing (which has a similar effect)
65
Weighting PMI: Giving rare context words
slightly higher probability
66