6 Vector Apr18 2021
6 Vector Apr18 2021
Vector
Semantics &
Embeddings
What do words mean?
N-gram or text classification methods we've seen so far
◦ Words are just strings (or indices wi in a vocabulary list)
◦ That's not very satisfactory!
Introductory logic classes:
◦ The meaning of "dog" is DOG; cat is CAT
∀x DOG(x) ⟶ MAMMAL(x)
Old linguistics joke by Barbara Partee in 1967:
◦ Q: What's the meaning of life?
◦ A: LIFE
That seems hardly better!
Desiderata
What should a theory of word meaning do for us?
Let's look at some desiderata
From lexical semantics, the linguistic study of word
meaning
Lemmas and senses
lemma
mouse (N)
sense
1. any of numerous small rodents...
2. a hand-operated device that controls
a cursor... Modified from the online thesaurus WordNet
"
car, bicycle
cow, horse
Ask humans how similar 2 words are
hospitals
surgeon, scalpel, nurse, anaesthetic, hospital
restaurants
waiter, menu, plate, food, menu, chef
houses
door, roof, kitchen, family, bed
Relation: Antonymy
Senses that are opposites with respect to only one
feature of meaning
Otherwise, they are very similar!
dark/light short/long fast/slow rise/fall
hot/cold up/down in/out
More formally: antonyms can
◦ define a binary opposition or be at opposite ends of a scale
◦ long/short, fast/slow
◦ Be reversives:
◦ rise/fall, up/down
Connotation (sentiment)
PI #43:
"The meaning of a word is its use in the language"
Let's define words by their usages
One way to define "usage":
words are defined by their environments (the words around them)
空心菜
kangkong
rau muống
…
The asphalt that Los Angeles is famous for occurs mainly on its freeways. But
in the middle of the city is another patch of asphalt, the La Brea tar pits, and this
asphalt preserves millions of fossil bones from the last of the Ice Ages of the Pleis-
Vector Semantics
Vector
Semantics &
Embeddings
Words and Vectors
Vector
Semantics &
Embeddings
just so they fit on the page; in real term-document matrices, the vectors representing
each document would have dimensionality |V |, the vocabulary size.
Term-document matrix
The ordering of the numbers in a vector space indicates different meaningful di-
mensions on which documents vary. Thus the first dimension for both these vectors
corresponds to the number of times the word battle occurs, and we can compare
Eachdimension,
each documentnotingis represented
for exampleby a vector
that of words
the vectors for As You Like It and Twelfth
Night have similar values (1 and 0, respectively) 6.3for•theW ORDS
first AND V ECTORS
dimension. 7
As You
As You Like
Like It
It Twelfth Night
Twelfth Night Julius Caesar Henry V
battle 11 00 7 13
good 114
114 80
80 62 89
fool 36
36 58
58 1 4
wit 20
20 15
15 2 3
Figure 6.3
6.2 The term-document matrix for four four words
words in
in four
fourShakespeare
Shakespeareplays.
plays.Each
The cell
red
contains
boxes thethat
show number
eachof times theis(row)
document word occurs
represented in the (column)
as a column vector of document.
length four.
represented as a count
We can think of thevector,
vectoraforcolumn in Fig. as
a document 6.3.a point in |V |-dimensional space;
vector thusTo
thereview someinbasic
documents linear
Fig. 6.3 arealgebra, vector is, at heart,
points ina4-dimensional space.just a list
Since or array of
4-dimensional
numbers.
spaces are So AstoYou
hard Like It is
visualize, represented
Fig. 6.4 showsas the list [1,114,36,20]
a visualization (the first column
in two dimensions; we’ve
spaces are hard to visualize, Fig. 6.4 shows a visualization in two dimensions; we’v
Visualizing document vectors
arbitrarily chosen the dimensions corresponding to the words battle and fool.
40
Henry V [4,13]
15
battle
5 10 15 20 25 30 35 40 45 50 55 60
fool
Figure 6.4 A spatial visualization of the document vectors for the four Shakespeare pl
mensions on which documents vary. Thus the first dimension for both these vectors
corresponds to the number of times the word battle occurs, and we can compare
Vectors are the basis of information retrieval
each dimension, noting for example that the vectors for As You Like It and Twelfth
Night have similar values (1 and 0, respectively) for the first dimension.
40
vector, hence with different dimensions, as shown in Fig. 6.5. The four dimensions
Idea for word meaning: Words can be vectors too!!!
of the vector for fool, [36,58,1,4], correspond to the four Shakespeare plays. Word
counts in the same four dimensions are used to form the vectors for the other 3
words: wit, [20,15,2,3]; battle, [1,0,7,13]; and6.3 W ORDS AND V ECTORS 7
good• [114,80,62,89].
As You
As You Like
Like It
It Twelfth Night
Twelfth Night JuliusCaesar
Julius Caesar HenryVV
Henry
battle
battle 11 00 77 13
13
good 114
114 80
80 62
62 89
89
fool 36
36 58
58 11 44
wit 20
20 15
15 22 33
Figure
Figure 6.5
6.2 The term-document matrixmatrix for
for four
fourwords
wordsininfour
fourShakespeare
Shakespeareplays.
plays.Each
Thecell
red
boxes show
contains thethat eachof
number word is the
times represented as aoccurs
(row) word row vector
in theof length four.
(column) document.
battle is "the kind of word that occurs in Julius Caesar and Henry V"
For documents,
represented we vector,
as a count saw thata similar
columndocuments
in Fig. 6.3.had similar vectors, because sim-
tor ilar documents
foolTo isreview tend
some
"the kind toword
ofbasichave similar
linear
that words.
algebra,
occurs This same
ainvector principle
is, at heart,
comedies, justapplies
especially list ortoarray
aTwelfth words:
of
Night"
similar
numbers. words have
So As Yousimilar
Like Itvectors becauseasthey
is represented the tend to occur in similar
list [1,114,36,20] documents.
(the first column
The
vectorterm-document matrix
in Fig. 6.3) and thusCaesar
Julius lets usisrepresent
representedthe meaning
as the listof[7,62,1,2]
a word by(thethethird
doc-
ace uments
column itvector). vector
tends toAoccur in. space is a collection of vectors, characterized by their
times the
dow around therow (target)
word, word and of
for example the4column
words (context)
to the leftword
andco-occur
4 wordsintosome context
the right,
in some
case training
the cell corpus. Thethe
context could
of be the document, in whichcorpus)
case thethe
cell
More common: word-word matrix
in which
column represents
word occurs
represents
the number
in suchof times
a ±4
number
the window
word two words
times (in some training
appear
around theinrow
the word.
same document.
For example It is
(or "term-context
here ismost common,each
one example
matrix")
however, to use
of some
word, for example of 4 words
smaller
words contexts,
in their generally a window around the
windows:
to the left and 4 words to the right, in which case
is cell
the traditionally
representsfollowed by of
the number cherry pie,training
times (in some a traditional
corpus)dessert
the column word
occurs inoften
suchmixed,
a ±4 word as strawberry
suchwindow around the rhubarb
row word. pie.For
Apple pie here is one
example
Two words
computer are
eachsimilar
peripherals
example of and in meaning
somepersonal
words in digital if their
their windows: context
assistants. vectors
These devicesare similar
usually
a computer. This includes information available on the internet
If we thenistake
traditionally followed by
every occurrence cherry
of each word (saypie, a traditionaland
strawberry) dessert
count the
often mixed, such as strawberry rhubarb pie. Apple pie
context words around it, we get a word-word co-occurrence matrix. Fig. 6.6 shows a
computer peripherals and personal digital assistants. These devices usually
simplified subset of the word-word co-occurrence matrix for these four words com-
a computer. This includes information available on the internet
puted from the Wikipedia corpus (Davies, 2015).
If we then take every occurrence of each word (say strawberry) and count the con-
text wordsaardvark
around it, we
... getcomputer
a word-worddata
co-occurrence
result matrix.
pie Fig. 6.5 shows
sugar ... a
simplified subset
cherry 0 of the...word-word2 co-occurrence
8 matrix
9 for 442
these four25words...com-
puted from the0 Wikipedia
strawberry ... corpus0(Davies, 2015).
0 1 60 19 ...
digital Note in Fig.
0 6.5 that
... the two
1670words cherry
1683 and strawberry
85 5 are more4 similar
... to
each other (both
information 0 pie and... sugar3325
tend to occur in their378
3982 window) 5than they
13are to...other
4000
information
computer
3000 [3982,3325]
digital
2000 [1683,1670]
1000
As weThewilldot
see,product tends
most metrics to be high
for similarity when
between thearetwo
vectors based on the dot
vectors
product. The dothave large
product actsvalues in themetric
as a similarity same dimensions
because it will tend to be high
just when the two vectors have large values in the same dimensions. Alternatively,
Dotthatproduct
vectors have zeroscan thus bedimensions—orthogonal
in different a useful similarity vectors—will
metric have a
between
dot product vectors their strong dissimilarity.
of 0, representing
This raw dot product, however, has a problem as a similarity metric: it favors
long vectors. The vector length is defined as
will see, most metrics for similarity between vectors are based on the dot
ct. The dot product acts as a similarity metric because it will tend to be high
Problem with raw dot-product
hen the two vectors have large values in the same dimensions. Alternatively,
s that have zeros in different dimensions—orthogonal vectors—will have a
oduct ofDot product favors
0, representing long vectors
their strong dissimilarity.
is raw dot product, however, has a problem as a similarity
Dot product is higher if a vector is longer (has higher metric: it favors
ectors. The vector
values lengthdimension)
in many is defined as
v
Vector length: u N
uX
|v| = t v2i (6.8)
i=1
Frequent
ot product is higher words (of,isthe,
if a vector you)
longer, have
with long
higher vectors
values (since
in each dimension.
frequentthey
wordsoccur
havemany
longertimes with
vectors, other
since theywords).
tend to co-occur with more
and haveSohigher co-occurrence
dot product overly values withfrequent
favors each of them.
wordsThe raw dot product
ill be higher for frequent words. But this is a problem; we’d like a similarity
This raw dot product, however, has a problem as a similarity metric: it favors
ector length long similarity
The cosine vector length
vectors. Themetric is defined
between as
two vectors ~v and ~w thus can be computed
s: Alternative: cosine for computing word similarity
v
u N
uX
|v| = t v2i (6.8)
i=1
N
X
vi wi
The dot product is higher if a vector
~v · ~wis longer, with higher values in each dimension.
i=1
cosine(~
More frequent words v,~
w) =longer vectors,
have = vsince they vtend to co-occur with more (6.10)
|~v||~w| values
words and have higher co-occurrence uwith
N eachuof Nthem. The raw dot product
uX uX
thus will be higher for frequent words. But tthis isva2 t 2
problem;wwe’d like a similarity
i i
metric that tells us how similar two words are regardless of their frequency.
i=1 i=1
We modify the dot product to normalize for the vector length by dividing the
For somedotapplications
product by the we pre-normalize
lengths of each of the twoeach vector,
vectors. bynormalized
This dividing it dotby its length,
product
turns vector
reating a unit out to be of
the length
same as 1.
the Thus
cosine ofwe thecould
angle between
compute theatwo vectors,
unit vectorfollowing
from ~ab by
Based on the definition of the dot product between
from the definition of the dot product between two vectors a and b: two vectors a and
ividing it by |~a|. For unit vectors, the dot product is the same as the cosine.
The cosine value ranges from 1 for avectors pointing
· b = |a||b| cos q in the same direction, through
for vectors that are orthogonal, to a-1· bfor=vectors
cos q pointing in opposite directions.
(6.9)
But raw frequency values are non-negative,
|a||b| so the cosine for these vectors ranges
Cosine as a similarity metric
500
cherry
digital information
Dimension 2: ‘computer’
ure 6.7 A (rough) graphical demonstration of cosine similarity, showing vec
Cosine for computing word
Vector similarity
Semantics &
Embeddings
TF-IDF
Vector
Semantics &
Embeddings
But raw frequency is a bad representation
f-idf weighting
Wordsto thelikeShakespeare
"the" or "it" term-document
have very low idfmatrix in Fig. 6.2.
idf values for the dimension corresponding to the word good have
0; sincePMI: (Pointwise
this word appears mutual information)
in every document, the tf-idf algorithm
𝒑(𝒘𝟏 ,𝒘𝟐 )
ored in any comparison
◦ PMI 𝒘𝟏 , 𝒘𝟐 of=the 𝒍𝒐𝒈plays. Similarly, the word fool, which
𝒑 𝒘𝟏 𝒑(𝒘𝟐 )
t of the 37 plays, has a much lower weight.
eighting is bySeefar ifthe dominant
words way of
like "good" weighting
appear more co-occurrence ma-than
often with "great"
tion retrieval,webut also expect
would plays abyrole in many other aspects of natural
chance
f Shakespeare’s favorite adjectives, a fact probably related to the increased use of
Term frequency (tf)
tft,d = count(t,d)
tft,d = log10(count(t,d)+1)
for discriminating those documents from the rest of the collection; terms that occur
nt
frequently across the entire collection aren’t as helpful. The document frequency
y
Document frequency (df)
dft of a term t is the number of documents it occurs in. Document frequency is
not the same as the collection frequency of a term, which is the total number of
times the word appears in the whole collection in any document. Consider in the
df t is the number of documents t occurs in.
collection of Shakespeare’s 37 plays the two words Romeo and action. The words
have(note
identical collection frequencies (they both occur 113 times
this is not collection frequency: total count across in all the plays) but
very different document frequencies, since Romeo only occurs in a single play. If
all documents)
our goal is to find documents about the romantic tribulations of Romeo, the word
"Romeo"
Romeo should beishighly
veryweighted,
distinctive
but notfor one Shakespeare play:
action:
Collection Frequency Document Frequency
Romeo 113 1
action 113 31
We emphasize discriminative words like Romeo via the inverse document fre-
df quency or idf term weight (Sparck Jones, 1972). The idf is defined using the frac-
tion N/df , where N is the total number of documents in the collection, and df is
umber of documents
common as toin many collections,
be completely this measure
non-discriminative since they o
Inverse
gooddocument
or sweet. 3 frequency (idf)
th a log function. The resulting definition for inverse
thus Word df idf
Romeo 1 1.57
✓ ◆ salad 2 1.27
N
idft = log10 Falstaff 4 (6.13)
0.967
dft forest 12 0.489
battle 21 0.246
or some words in the Shakespeare
N is the total number of documents corpus,
wit ranging
34 0.037 from
s which fool 36 0.012
occur in only one play like Romeo, to those that
in the collection
good 37 0
Falstaff, to those which are very common
sweet like
37 fool
0 or so
y non-discriminative since they occur in all 37 plays like
What is a document?
Final
tf-idf
tf-idf weighted value for a word
The tf-idf weighted value w for word t in document d thus combines term
t, d
t, d
wt, d = tft, d6.3⇥•idf
frequency tf (defined either by Eq. 6.11 or by Eq. 6.12) with idf from Eq. 6.13:
Wt V
ORDS AND 7 ECTORS
Raw counts: wt, d = tft, d ⇥ idft (6.14)
As You Like It Twelfth Night Julius Caesar Henry V
battle 1 weighting to the Shakespeare
0 7 13Fig. 6.2,
tf-idf weighting to the Shakespeare term-documen
Fig. 6.9 applies tf-idf
good the tf equation
using 114 Eq. 6.12. Note 80
term-document
that the tf-idf values
matrix in
62for the dimension
89 corre-
fool 36 58 1 4
uation Eq. 6.12. Note that the tf-idf values for the
sponding to the word good have now all become 0; since this word appears in every
wit 20 15 2 3
document, the tf-idf algorithm leads it to be ignored. Similarly, the word fool, which
Figure 6.2 The term-document matrix for four words in four Shakespeare plays. Each cell
wordtf-idf:
good have now all become 0; since this wor
appears
contains in
the36 out ofofthe
number 37 the
times plays, has
(row) a much
word occurslower
in theweight.
(column) document.
f-idf battle
algorithm
represented
vector
as a countleads
As You Like It
0.074 it to
vector, a column
0 be
Twelfth Night
in Fig. ignored.
6.3. Julius Caesar
0.22 Similarly,
Henry V
0.28
To review some basic linear algebra, a vector is, at heart, just a list or array of
th
good 0 0 0 0
ut of numbers.
the
fool 37
0.019plays, has
So 0.021 a much
As You Like It is represented as the list lower
[1,114,36,20]
0.0036 weight.first column
(the0.0083
vector in Fig. 6.3) and Julius Caesar is represented as the list [7,62,1,2] (the third
wit 0.049 0.044 0.018 0.022
or space column6.9vector).
Figure A tf-idf vector space
A weighted is a collection
term-document matrix of
forvectors, characterized
four words by their
in four Shakespeare
TF-IDF
Vector
Semantics &
Embeddings
PPMI
Vector
Semantics &
Embeddings
Pointwise Mutual Information
𝑃(𝑤𝑜𝑟𝑑! , 𝑤𝑜𝑟𝑑" )
PMI 𝑤𝑜𝑟𝑑! , 𝑤𝑜𝑟𝑑" = log "
𝑃 𝑤𝑜𝑟𝑑! 𝑃(𝑤𝑜𝑟𝑑" )
Positive Pointwise Mutual Information
◦ PMI ranges from −∞ to + ∞
◦ But the negative values are problematic
◦ Things are co-occurring less than we expect by chance
◦ Unreliable without enormous corpora
◦ Imagine w1 and w2 whose probability is each 10-6
◦ Hard to be sure p(w1,w2) is significantly different than 10-12
◦ Plus it’s not clear people are good at “unrelatedness”
◦ So we just replace negative PMI values by 0
◦ Positive PMI (PPMI) between word1 and word2:
𝑃(𝑤𝑜𝑟𝑑! , 𝑤𝑜𝑟𝑑" )
PPMI 𝑤𝑜𝑟𝑑! , 𝑤𝑜𝑟𝑑" = max log " ,0
𝑃 𝑤𝑜𝑟𝑑! 𝑃(𝑤𝑜𝑟𝑑" )
context c j . This can be turned into a PPMI matrix where ppmii j gives the PPMI
value of word wi with context c j as follows:
PMI has the problem of being biased toward infrequent events; very rare words
pmi(information,data)tend
= tolog .3399
2 (very
have
/ (.6575*.4842) ) = .0944
high PMI values. One way to reduce this bias toward low frequency
HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS
Resulting PPMI matrix (negatives replaced by 0)
computer data result pie sugar
cherry 0 0 0 4.38 3.30
strawberry 0 0 0 4.10 5.51
digital 0.18 0.01 0 0 0
information 0.02 0.09 0.28 0 0
Figure 6.12 The PPMI matrix showing the association between words and context words,
64
Weighting PMI
PMI is biased toward infrequent events
◦ Very rare words have very high PMI values
Two solutions:
◦ Give rare words slightly higher probabilities
◦ Use add-one smoothing (which has a similar effect)
65
Weighting PMI: Giving rare context words slightly
PMI has the problem of being biased toward infrequent events; very rare words
higher probability
tend to have very high PMI values. One way to reduce this bias toward low frequency
events is to slightly change the computation for P(c), using a different function Pa (c)
that raisesRaise
contexts to the
the power of
context a (Levy et al.,to
probabilities 2015):
𝛼 = 0.75:
P(w, c)
PPMIa (w, c) = max(log2 , 0) (19.8)
P(w)Pa (c)
count(c)a
Pa (c) = P a
(19.9)
c count(c)
Levy et al. (2015) found that a setting of a = 0.75 improved performance of
This helps because 𝑃 # 𝑐 > 𝑃 𝑐 for rare c
embeddings on a wide range of tasks (drawing on a similar weighting used for skip-
Consider
grams (Mikolov et al.,two events,
2013a) P(a) (Pennington
and GloVe = .99 and P(b)=.01
et al., 2014)). This works
because raising the probability to."#a = 0.75 increases the probability."#assigned to rare
contexts, and 𝑃
hence lowers .%%
their PMI (P= > P(c) when .'!
c is rare).
# 𝑎 = ."# ."# .97
a (c) 𝑃# 𝑏 = ."# ."# = .03
.%% is&.'!
Another possible solution .'! &.'! PMI, a small
Laplace smoothing: Before computing
66
Word2vec
Vector
Semantics &
Embeddings
Sparse versus dense vectors
apricot
… … W target words
zebra |V|
&= aardvark |V|+1
apricot
apricot apricot
jam jam apricotapricot
where whereapricotapricot
dear dear
apricot apricot
a a apricotapricot
coaxialcoaxial
apricotapricot
if if
Word2vec: how to learn vectors
Given the set of positive and negative training instances,
and an initial set of embedding vectors
The goal of learning is to adjust those word vectors such
that we:
◦ Maximize the similarity of the target word, context word pairs
(w , cpos) drawn from the positive data
◦ Minimize the similarity of the (w , cneg) pairs drawn from the
negative data.
4/18/21 87
• Minimize the similarity of the (w, cneg ) pairs from the negative examples.
If we consider one word/context pair (w, cpos ) with its k noise words cneg1 ...cnegk ,
Loss function for one w with c
we can express these two goals as the following loss posfunction
neg1,c ...c
negk
L to be minimized
(hence the ); here the first term expresses that we want the classifier to assign the
Maximize the
real context similarity
word of the
cpos a high targetof
probability with thea neighbor,
being actual context
and the words,
second term
and minimize
expresses thewant
that we similarity
to assignofeach
theoftarget withwords
the noise the kcneg
negative
a high sampled of
probability
i
non-neighbor words.all multiplied because we assume independence:
being a non-neighbor,
" k
#
Y
LCE = log P(+|w, cpos ) P( |w, cnegi )
i=1
" k
#
X
= log P(+|w, cpos ) + log P( |w, cnegi )
i=1
" k
#
X
= log P(+|w, cpos ) + log 1 P(+|w, cnegi )
i=1
" k
#
X
= log s (cpos · w) + log s ( cnegi · w) (6.34)
i=1
Learning the classifier
How to learn?
◦ Stochastic gradient descent!
C matrix cneg1
k=2
Tolstoy cneg2 move apricot and Tolstoy apart
decreasing cneg2 w
zebra
Reminder: gradient descent
• At each step
• Direction: We move in the reverse direction from the
gradient of the loss function
GISTIC R EGRESSION
• Magnitude: we move the value of this gradient
(
𝐿(𝑓 𝑥; 𝑤 , 𝑦) weighted by a learning rate η
()
• Higher learning rate means move w faster
t+1 t d
w =w h L( f (x; w), y)
dw
pos k
X
resses that we=want log
to assign
P(+|w, each
cposof
) +the noise
log 1wordsP(+|w, cnegi caneg
high) probability o
The derivatives
ng• a non-neighbor, all multiplied of the
because
V ECTOR S EMANTICS AND E MBEDDINGS i=1
loss
we function
assume independence:
i
" " k
# #
Y
X
LCE == log log sP(+|w, cpos+)
(cpos · w) log s|w,
P( ( ccneg
negi )i · w) (6.34
oof as an exercise at the end of the chapter):
i=1
" #
k of the word with the actual contex
is, we want ∂toLmaximize
CE
the dot product
X
s, and minimize = the dot =log [s
P(+|w,
(cpos c·pos
products w) +1]w
of)the logwith
word P( the
|w, ckneg )
negative sampled non
∂ cpos i=1
i
hbor words. " #
∂ L
We minimize this lossCE function using k
stochastic gradient descent. Fig. 6.1
X
=log [s (c · w)]w
negc ) +
P(+|w, log 1 P(+|w, cnegi )
s the intuition∂ cofnegone step of learning.
= pos
i=1
" k
X #
∂ LCE Xk
= [s (cpos · w) 1]cpos +
aardvark [s (cnegi · w)]cnegi
∂ w log s (cpos · w)move
= + log s ( cnegi · w)
apricot and jam closer,
(6.3
i=1
i=1
increasing c w
apricot w
= [s (cpos · w) 1]cpos + [s (cnegi · w)]cnegi
∂w
Update equation in SGD i=1
Just as in logistic regression, then, the learning algorithm starts with randoml
ialized W and C matrices, and then walks through the training corpus using gra
descent to move W and C so as to maximize the objective in Eq. 6.34 by makin
Two sets of embeddings
SGNS learns two sets of embeddings
Target embeddings matrix W
Context embedding matrix C
It's common to just add them together,
representing word i as the vector wi + ci
Summary: How to learn word2vec (skip-gram)
embeddings
Start with V random d-dimensional vectors as initial
embeddings
Train a classifier based on embedding similarity
◦Take a corpus and take pairs of words that co-occur as positive
examples
◦Take pairs of words that don't co-occur as negative examples
◦Train the classifier to distinguish these by slowly adjusting all
the embeddings to improve the classifier performance
◦Throw away the classifier code and keep the embeddings.
Word2vec: Learning the
Vector embeddings
Semantics &
Embeddings
Properties of Embeddings
Vector
Semantics &
Embeddings
The kinds of neighbors depend on window size
Small windows (C= +/- 2) : nearest words are syntactically
similar words in same taxonomy
◦Hogwarts nearest neighbors are other fictional schools
◦Sunnydale, Evernight, Blandings
Large windows (C= +/- 5) : nearest words are related
words in same semantic field
◦Hogwarts nearest neighbors are Harry Potter world:
◦Dumbledore, half-blood, Malfoy
ability to capture relational meanings. In an important early vector space model of
Analogical relations
cognition, Rumelhart and Abrahamson (1973) proposed the parallelogram model
for solving simple analogy problems of the form a is to b as a* is to what?. In such
Thea system
problems, classic parallelogram
given model of analogical
a problem like apple:tree::grape:?, i.e., applereasoning
is to tree as
grape is to , and must fill in
(Rumelhart and Abrahamson 1973) the word vine. In the parallelogram model, illus-
# » # »
trated in Fig. 6.15, the vector from the word apple to the word tree (= apple tree)
Totosolve:
is added "apple
the vector for grapeis (to
# treetheasnearest
»
grape); grape istotothat_____"
word point is returned.
vine
grape
arallelogram method received more modern attention because of
Analogical relations via parallelogram
word2vec or GloVe vectors (Mikolov et al. 2013b, Levy and Gold
#
ngton et al. 2014). For example, the result of the expression (kin
»n is a The parallelogram # method
» can solve# analogies
» # with
vector close to queen. Similarly, Paris France + Italy)
» # »
both sparse# and» dense embeddings (Turney and
that isLittman
close to2005,
Rome. The embedding
Mikolov et al. 2013b) model thus seems to be ex
ations of relations like MALE - FEMALE, or CAPITAL - CITY- OF, or
king – man + woman is close to
TIVE / SUPERLATIVE , as shown in Fig. 6.16 from GloVe.
queen
or a a:b::a*:b*Parisproblem,
– Francemeaning
+ Italy isthe
close to Rome
algorithm is given a, b, and
For a problemmethod
*, the parallelogram a:a*::b:b*, the parallelogram method is:
is thus:
⇤ ⇤
b̂ = argmax distance(x, a a + b)
x
Structure in GloVE Embedding space
Caveats with the parallelogram method
It only seems to work for frequent words, small
distances and certain relations (relating countries to
capitals, or parts of speech), but not others. (Linzen
2016, Gladkova et al. 2016, Ethayarajh et al. 2019a)
William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic Word Embeddings Reveal
Statistical Laws of Semantic Change. Proceedings of ACL.
Embeddings reflect cultural bias!
Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. "Man is to computer
programmer as woman is to homemaker? debiasing word embeddings." In NeurIPS, pp. 4349-4357. 2016.