2 Vector Semantics
2 Vector Semantics
Zellig Harris, “Distributional Structure” (1954) Ludwig Wittgenstein, Philosophical Investigations (1953)
everyone likes ______________
• Words that appear in similar contexts have similar representations (and similar
meanings, by the distributional hypothesis).
knife 1 1 4 2 2 10
dog 6 12 2
sword 2 2 7 5 5 17
love 64 135 63 12 48
like 75 38 34 36 34 41 27 44
knife 1 1 4 2 2 10
sword 2 2 7 5 5 17
• We can calculate the cosine similarity of two vectors to judge the degree of
their similarity [Salton 1971]
knife 1 1 4 2 2 10
dog 6 12 2
sword 2 2 7 5 5 17
love 64 135 63 12 48
like 75 38 34 36 34 41 27 44
cos(knife, knife) 1
N
tf idf (t, d) = tft,d log
Dt
IDF
Romeo Richard Julius King
Hamlet Macbeth Tempest Othello IDF
& Juliet III Caesar Lear
knife 1 1 4 2 2 2 0.12
dog 2 6 6 2 12 0.20
sword 17 2 7 12 2 17 0.12
like 75 38 34 36 34 41 27 44 0
P (w, c)
P P M I = max log2 ,0
P (w)P (c)
Romeo Richard Julius King
Hamlet Macbeth Tempest Othello total
& Juliet III Caesar Lear
knife 1 1 4 2 2 2 12
dog 2 6 6 2 12 28
sword 17 2 7 12 2 17 57
like 75 38 34 36 34 41 27 44 329
135
748
P M I(love, R&J) = 186 322
748 748
Term-context matrix
• Rows and columns are both words; cell counts = the number of times
word wi and wj show up in the same context (e.g., a window of 2
tokens).
Dataset
dog 2 1 1 1 …
cat 2 0 1 1 …
Dataset
DOG terms (window = 2)
R: ate
L: the big L: the small L: the yellow …
dinner
dog 1 1 0 0 …
term
cat 0 1 1 1 …
• Each cell enumerates the number of time a directional context phrase appeared in a
specific position around the term.
write a book
write a poem
target
possibly impossibly certain uncertain
generating generated shrinking shrank
think thinking look looking
Baltimore Maryland Oakland California
shrinking shrank slowing slowed
Rabat Morocco Astana Kazakhstan
A 0
Sparse vectors a
aa
aal
0
0
0
aalii 0
aam 0
Aani 0
aardvark 1
“aardvark” aardwolf 0
... 0
zymotoxic 0
zymurgy 0
Zyrenian 0
Zyrian 0
vectors
0.7
→ 1.3
-4.5
Singular value decomposition
• Any n⨉p matrix X can be decomposed into the product of three
matrices (where m = the number of linearly independent rows)
9
4
3
1
2
⨉ 7 ⨉
9
8
1
9
4
0
0
0
⨉ 0 ⨉
0
0
0
9
4
0
0
0
⨉ 0 ⨉
0
0
0
knife 1 1 4 2 2 2
dog 2 6 6 2 12
sword 17 2 7 12 2 17
love 64 135 63 12 48
like 75 38 34 36 34 41 27 44
dog
sword
love
like
Low-dimensional Low-dimensional
representation for representation for
terms (here 2-dim) documents (here 2-dim)
dog
sword
love
like
Latent semantic analysis
gin and
a cocktail with gin
gin seltzer
and seltzer
Window size = 3
Dimensionality reduction
… …
the 1
a 0 the
an 0
for 0 4.1
in 0 -0.9
on 0
dog 0
cat 0
… …
gin x1 y gin
h1
cocktail x2 y cocktail
h2
globe x3 y globe
x W V y
gin x1 y gin
h1
cocktail x2 y cocktail
h2
globe x3 y globe
• Can you predict the output word from a vector representation of the
input word?
… … … … … … …
https://nlp.stanford.edu/projects/glove/
y
dog
cat puppy
wrench
screwdriver
• Why this behavior? dog, cat show up in similar positions
To make the same predictions, these numbers need to be close to each other.
“Word embedding” in NLP papers
0.7
0.525
0.35
0.175
0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
• Mikolov et al. 2013 show that vector representations have some potential for
analogical reasoning through vector arithmetic.
Mikolov et al., (2013), “Linguistic Regularities in Continuous Space Word Representations” (NAACL)
Bias
Blodgett et al. (2020), “Language (Technology) is Power: A Critical Survey of “Bias” in NLP”
Representations
Kiritchenko and Mohammad (2018), "Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems"
Interrogating “bias”
• Kozlowski et al. (2019), “The
Geometry of Culture:
Analyzing the Meanings of
Class through Word
Embeddings,” American
Sociological Review.
• Lets your representation of the input share statistical strength with words
that behave similarly in terms of their distributional properties (often
synonyms or words that belong to the same class).
Two kinds of “training” data
• The labeled data for a specific task (e.g., labeled sentiment for movie
reviews): ~ 2K labels/reviews, ~1.5M words → used to train a supervised
model
• General text (Wikipedia, the web, books, etc.), ~ trillions of words → used to
train word distributed representations
Using dense vectors
• Can also take the derivative of the loss function with respect to those
representations to optimize for a particular task.
emoji2vec
Eisner et al. (2016), “emoji2vec: Learning Emoji Representations from their Description”
node2vec
Grover and Leskovec (2016), “node2vec: Scalable Feature Learning for Networks”
Trained embeddings
• Word2vec
https://code.google.com/archive/p/word2vec/
• Glove
http://nlp.stanford.edu/projects/glove/
HW1 out today