4.machine Learning Word Embedding-1
4.machine Learning Word Embedding-1
Tien-Lam Pham
Contents
Word Meaning
Word similarity
Tf-idf, PMI
Word Embedding
Skip gram models
WhatWord
do words mean?
Meaning
N-gram or text classification methods we've seen so far
◦ Words are just strings (or indices wi in a vocabulary list)
◦ That's not very satisfactory!
Lemmaslogic
Introductory and classes:
senses
◦ The meaning of "dog" is DOG; cat is CAT
lemma
∀x DOG(x) ⟶ MAMMAL(x)
mouse
Old linguistics joke(N)
by Barbara Partee in 1967:
◦ Q: What's1.the
anymeaning
of of life? small rodents...
numerous
◦ sense
A: LIFE
2. a hand-operated device that controls
That seems hardly better!
a cursor... Modified from the online thesaurus WordNet
Words that
◦ cover a particular semantic domain
◦ bear structured relations with each other.
hospitals
surgeon, scalpel, nurse, anaesthetic, hospital
restaurants
waiter, menu, plate, food, menu, chef
houses
door, roof, kitchen, family, bed
Relation:Field
Semantic Antonymy
Senses that are opposites with respect to only one
feature of meaning
Otherwise, they are very similar!
dark/light short/long fast/slow rise/fall
hot/cold up/down in/out
More formally: antonyms can
◦ define a binary opposition or be at opposite ends of a scale
◦ long/short, fast/slow
◦ Be reversives:
◦ rise/fall, up/down
Connotation
Word similarity
(sentiment)
Figure 6.1 A two-dimensional (t-SNE) projection of embeddings for some words and
phrases, showing that words with similar meanings are nearby in space. The original 60-
Computational Models for Word meaning
tf-idf
◦ Information Retrieval workhorse!
◦ A common baseline model
◦ Sparse vectors
◦ Words are represented by (a simple function of) the counts of nearby
words
Word2vec
◦ Dense vectors
◦ Representation is created by training a classifier to predict whether a
word is likely to appear nearby
◦ Later we'll discuss extensions called contextual embeddings
just so they fit on the page; in real term-document matrices, the vectors representing
battleeach document 1 would have dimensionality
0 7
|V |, the vocabulary size. 13
good Term-document matrix
114 of the numbers in
The ordering
Term-Document
80a vector space indicates62 different meaningful
89 di-
fool mensions which documentsMatrix
on 36 58 Thus the first dimension
vary. 1 4 vectors
for both these
wit corresponds 20 to the number of times 15 the word battle occurs,2 and we can 3 compare
Figure 6.3 dimension,
Each
each document is represented
The term-document
noting formatrix forby
example a vector
four
that words
the inoffour
vectors words
forShakespeare
As You Likeplays.
It andThe red
Twelfth
oxes Night
show that
haveeach document
similar valuesis(1represented as a column
6.3for
and 0, respectively) vector
•theW ofORDS
length
first four.V ECTORS 7
AND
dimension.
r thusTo
thereview someinbasic
documents linear
Fig. 6.3 arealgebra, vector is, at heart,
points ina4-dimensional space.just a list
Since or array of
4-dimensional
10 are So
numbers.
spaces AstoYou
Julius
hard Like [1,7]
Caesar It is
visualize, represented
Fig. 6.4 showsas the list [1,114,36,20]
a visualization (the first column
in two dimensions; we’ve
vector in Fig.
arbitrarily 6.3)the
chosen and Julius Caesar
dimensions is represented
corresponding to theaswords
the list [7,62,1,2]
battle (the third
and fool.
5 As You Twelfth Night [58,0] by their
e column vector). A vector space is a Like It [36,1]
collection of vectors, characterized
n dimension. In the example in Fig. 6.3, the document vectors are of dimension 4,
just so they
40 fit on the page; in real term-document matrices, the vectors representing
5 Henry 15 20 25 30 35 40 45 50 55 60
10 V [4,13]
each document would have dimensionality |V |, the vocabulary size.
15
The ordering of the numbers in a vector fool space indicates different meaningful di-
e
vector, hence with different dimensions, as shown in Fig. 6.5. The four dimensions
Idea for word meaning: Words can be vectors too!!!
of the vector for fool, [36,58,1,4], correspond to the four Shakespeare plays. Word
Word
counts in theVector?
same four dimensions are used to form the vectors for the other 3
words: wit, [20,15,2,3]; battle, [1,0,7,13]; and6.3 W ORDS AND V ECTORS 7
good• [114,80,62,89].
As You
As You Like
Like It
It Twelfth Night
Twelfth Night JuliusCaesar
Julius Caesar HenryVV
Henry
battle
battle 11 00 77 13
13
good 114
114 80
80 62
62 89
89
fool 36
36 58
58 11 44
wit 20
20 15
15 22 33
Figure
Figure 6.5
6.2 The term-document matrixmatrix for
for four
fourwords
wordsininfour
fourShakespeare
Shakespeareplays.
plays.Each
Thecell
red
boxes show
contains thethat eachof
number word is the
times represented as aoccurs
(row) word row vector
in theof length four.
(column) document.
battle is "the kind of word that occurs in Julius Caesar and Henry V"
For documents,
represented we vector,
as a count saw thata similar
columndocuments
in Fig. 6.3.had similar vectors, because sim-
ilar documents
foolTo isreview tend
some
"the kind toword
ofbasichave similar
linear
that words.
algebra,
occurs This same
ainvector
comedies, principle
is, at heart, justapplies
especially list ortoarray
aTwelfth words:
of
Night"
similar
numbers. words have
So As Yousimilar
Like Itvectors becauseasthey
is represented the tend to occur in similar
list [1,114,36,20] documents.
(the first column
The
vectorterm-document matrix
in Fig. 6.3) and thusCaesar
Julius lets usisrepresent
represented the meaning
as the listof[7,62,1,2]
a word by(thethethird
doc-
uments
column itvector). vector
tends toAoccur in. space is a collection of vectors, characterized by their
dimension. In the example in Fig. 6.3, the document vectors are of dimension 4,
just so they
6.3.3 fit on the
Words as page; in realword
vectors: term-document
dimensions matrices, the vectors representing
each document would have dimensionality |V |, the vocabulary size.
An alternative
The ordering to of
using the term-document
the numbers in a vector matrix to represent
space indicates words
different as vectorsdi-
meaningful of
document
mensions on counts, to use the term-term
whichisdocuments vary. Thus the matrix, also calledfor
first dimension word-word
theboth ma-
these vectors
the number of times
of a word’s therepresentation.
vector row (target) word andthan
Rather thethe column (context) matrix
term-document word co-occur
we use the
word-word
in some context in
term-term some more
matrix, training corpus. called
commonly The context could bematrix
the word-word the document, in
or the term-
matrix
whichcontext
case thematrix,
cell represents
in whichthe
thenumber
columnsofaretimes
labeledthe two wordsrather
by words appear in documents.
than the same
Word Vector?
This matrix
document. is thus
It is most of dimensionality
common, however, |Vto |use | and each
⇥ |Vsmaller cell records
contexts, the number
generally a win-of
times the
dow around therow (target)
word, word and of
for example the4column
words (context)
to the leftword
andco-occur
4 wordsintosome context
the right,
in some
case training
the cell corpus. Thethe
context could
of be the document, in whichcorpus)
case thethe
cell
More common: word-word matrix
in which
column represents
word occurs
represents
the number
in suchof times
a ±4
number
the window
word two words
times (in some training
appear
around theinrow
the word.
same document.
For example It is
(or "term-context
here ismost common,each
one example
matrix")
however,
word, for example of 4 words
to use
of some smaller
words contexts,
in their windows:generally a window around the
to the left and 4 words to the right, in which case
is cell
the traditionally
representsfollowed by of
the number cherry pie,training
times (in some a traditional
corpus)dessert
the column word
occurs inoften
suchmixed,
a ±4 word as strawberry
suchwindow around the rhubarb
row word. pie.For
Apple pie here is one
example
Two words
computer are
eachsimilar
peripherals
example of and in meaning
somepersonal
words in digital if their
their windows: context
assistants. vectors
These devicesare similar
usually
a computer. This includes information available on the internet
If we thenistake
traditionally followed by
every occurrence cherry
of each word (saypie, a traditionaland
strawberry) dessert
count the
often mixed, such as strawberry rhubarb pie. Apple pie
context words around it, we get a word-word co-occurrence matrix. Fig. 6.6 shows a
computer peripherals and personal digital assistants. These devices usually
simplified subset of the word-word co-occurrence matrix for these four words com-
a computer. This includes information available on the internet
puted from the Wikipedia corpus (Davies, 2015).
If we then take every occurrence of each word (say strawberry) and count the con-
text wordsaardvark
around it, we
... getcomputer
a word-worddataco-occurrence
result matrix.
pie Fig. 6.5 shows
sugar ... a
simplified subset
cherry 0 of the...word-word 2 co-occurrence
8 matrix
9 for 442these four25words...com-
puted from the0 Wikipedia
strawberry ... corpus0(Davies, 2015).
0 1 60 19 ...
digital Note in Fig.0 6.5 that... the two
1670words cherry
1683 and strawberry
85 5 are more4 similar
... to
each other (both
information 0 pie and ... sugar3325
tend to occur in their378
3982 window) 5than they 13are to...other
Figurewords like digital; conversely,
6.6 Co-occurrence vectors fordigital and information
four words are more
in the Wikipedia similar
corpus, to each
showing sixother
of
than, say,(hand-picked
the dimensions to strawberry.
for Fig. 6.6 shows
pedagogical a spatialThe
purposes). visualization.
vector for digital is outlined in
red. Note that a real vector would have vastly more dimensions and thus be much sparser.
nsions (hand-picked for pedagogical purposes). The vector for digital is outlined
To measure similarity between two target words v and w, we need a metr
that a real vector would have vastly
takes more
two vectors dimensions
(of the and thus
same dimensionality, eitherbe
bothmuch sparser.
with words as dimen
Word Vector? hence of length |V |, or both with documents as dimensions as documents, of
|D|) and gives a measure of their similarity. By far the most common sim
metric is the cosine of the angle between the vectors.
4000 The cosine—like most measures for vector similarity used in NLP—is ba
dot product the dot product operator from linear algebra, also called the inner product:
12
computer
C HAPTER 6 • V ECTOR S EMANTICS
inner product information
N
3000 [3982,3325]
X
dot product(v, w) = v · w = vi wi = v1 w1 + v2 w2 + ... + vN wN
digital i=1
2000 [1683,1670]
As we will see, most metrics for similarity between vectors are based on t
~a ·~bas =
product. The dot product acts |~a||~b| cos
a similarity q because it will tend to b
metric
1000 just when the two vectors have large values in the same dimensions. Alterna
·~b
vectors that have zeros in~a different
= dimensions—orthogonal
cos q vectors—will
dot product of 0, representing
|~a||~b|their strong dissimilarity.
This raw dot product, however, has a problem as a similarity metric: it
cosine vector length long similarity
The cosine vector length
vectors. Themetric is defined
between two as
vectors ~v and ~w thus can be
1000 2000 3000 4000
as: Alternative: cosine
data for computing word simila v
u N
uX
|v| = t v2i
i=1
N
X
6 A spatial visualization of word vectors for digital and information, vi wi showing ju
The dot product is higher if a vector
v
~ · w
~ is longer, with higher values in each dime
e dimensions, correspondingMore
to the words
frequent words data
cosine(~v,~w) =
have and
longer computer.
vectors,
=v since
i=1
they
vtend to co-occur with
words and have higher co-occurrence uwith
|~v||~w| values N eachuof Nthem. The raw dot p
uX uX
thus will be higher for frequent words. But tthis isva2 t 2
problem;wwe’d like a sim
i i
that |V |, the length of the vector, is generally the size of the vocabular
metric that tells us how similar two words are regardless of their frequency.
i=1 i=1
We modify the dot product to normalize for the vector length by dividi
rsbythat
|a|.
ButBut Let’s
are
For
raw
Therawunit see
orthogonal, how
vectors,
frequency
frequency
cosine valuetheto cosine
the -1dot
values
values
ranges for
are
are
fromcomputes
vectors
product
non-negative,
non-negative,
1 for which
is pointing
the
vectors soof
same
so the
in
the
the
pointing words
opposite
ascosine
the
cosine
in cosine.
the cherry
for
for
same these
these orvector
directions. digith
vecto
direction,
0–1.
vectors
sine
quency
et’s
raw see
inCosine examples
value
from
from
0 for
how
frequency
that aretoorthogonal,
meaning
ranges
values
0–1.
0–1.
vectors
the
are information,
from
that are
cosine
values
1
are for
to just
vectors
non-negative,
orthogonal,
computes
non-negative,
-1 for using
whichto -1
vectors
pointing
so the raw
sooffor
in
cosine
the
the
pointing
counts
the
vectors
words
cosine
same
for from in the
these
pointing
cherry
for
opposite
these
following
direction,
vectors
direction
through
inoropposite
digital
vectors
shor
rangesdirec
is
rangcl
ors that Word
But Let’s
are
Let’s
raw seeseehow
orthogonal,
Vector? howthe
frequency the cosine
tocosine
values -1 for are computes
vectors which
computes
non-negative, which
pointing
pie so of
of the
in
the
the
data words
opposite
words
cosine
computer cherry
cherry
for these or
directions.
or digital
digital
vectors r
e
0–1. in
aning
equency
how
Let’s see
Cosine
to meaning
infrom
the
information,
how
values
meaning
0–1.
cosine toto
the cosine
examples
are
just usingjust
information,
non-negative,
information,
computes
computes
just
which
N
rawusing
cherryso
of
which
counts
using the
the raw
raw
of
cosine
442
from
counts
counts
words
the words8
the
forfrom
cherry
following
from
these the
the
or
pie
cherry2
shortened
following
vectors
following
digital
ordata
digitalis shorten
ranges
shorten
closer
computer
is clos
ta
to Let’s
information, v • w seev how
just w the cosine
using
∑
raw
v w
computes
countspie
digital
i i
datawhich
from pie 5 the of the
computer words
following
1683
data 1670
computer cherry
shortened or digital
table: is
eaning to information,
cos( v, w) = = just using
• = i=1
raw counts pie cherry
fromdata the 442 8shortened
computer
following 2 tabl
ee how in the
meaning
v w
cosine to
v information,
w
computes
cherry
∑
N
v 2just using
which
N
442
information
cherry
∑ ofw 2 raw counts from the following shortened
the words
8442
442 5digitalcherry
8 2
3982 or digital
3325
pie
2 data is closer
computer
∑
Ni cherry i 8 25 1683 1670
g tocos( v •w v w
information,
v, w) = = just using
i=1
digital
rawpie vi wdata
counts
i=1 digital
i i=1
pie5 computer
data
1683
from computer
the
5cherry 1670
following
1683 1670
442 shortened
8 2 table:
•
vcherry
= digital 5
pie 1683
data 1670
computer
information 5 3982 3325
v w
information
∑ 442
w cherry N
∑8w 2 8
442
v 2information
information
cherry 5
N
442
3982
4425 2⇤3982
5
5digital8 2 8 ⇤ 3982
3982
3325
+ 3325
3325
25 2 ⇤ 3325
+1683 1670
cos(cherry,digital
information) i=1 i
digitalpie = data
5 digital
i=1
1683 p i
5 1683 computer
1670 1670 p =
442 25 + 16838 2 + 22 1670 5 2 + 39822 + 33252
cherry 442 8 2information 5 3982 3325
information
information 5 442
information5⇤ 53982
3982 442
442
+ 853325
⇤ 5+
⇤53982+88⇤+
3325
3982 3982
⇤3982
2
3325
⇤ +22⇤⇤3325
3325
+ 3325
cherry, cos(cherry,
information)
cos(cherry, information)
digital
information)
= p 5 = =1683p p 51670 ⇤p 5 + 1683 pp⇤ 3982 + 1670 = ⇤ 3325
.017 == .0.0
cos(digital, information) = 2 +8 p 442
2+
2 + 82 + 22 52 +p
2+ 282 5 222+2 3982 2 +2 3982
23982
2 + 33252
2 + 23325 2
information4425⇤442 442 3982 ⇤ 442
5 5 2
442
+ +
82 1683
3325 53982+
8 +
+ 25
1670
3982⇤ 33252+ 5 3325
2 +
3325 3982 2 + 332
information)
cos(cherry, information) 5 +
=5=⇤ 5p+p 8 ⇤ 53982
5 ⇤ ⇤
55 + +
+16832
1683⇤⇤ 3325
⇤ 3982
3982
+ +⇤ 1670
1670 ⇤ 3325
3325
(cherry,
y, information)
cos(digital, = =
information)
p p p 1683
⇤ p + ⇤ 3982 p ⇤+ 1670
p + ⇤ 3325
= ⇤=
.017 .017 = .017
digital, cos(digital,
information) information)
The model decides = p2 442
that = 2+
2 p 8 2+
442
information
52 2 2+2 282 5
1683
2 is+ 222+
way2p 3982
5 2 +22p
closer
1670
2 3982
+ to
5 2 2 + 23325
3325
digital
2 3982 2 2
than it
3325= is
.9
2
442442 + 2
5 5+⇤8 5 +
+ 2
85 ⇤ 2
+
2 398251683 ++ 3982
22+ 2
+
⇤ 3325
1670 +
2 233255
+
2 +239822 + 3325 + 2 2
ry, information)
result that seems
= p sensible. ⇤1683
Fig. 5 +6.7 p
+ ⇤+5 1670
5 shows
1683 +⇤ 1683
3982
a ⇤+53982
++
1670
visualization.3982
⇤1670
3325
= +
.0173325
⇤ 3325
5 5 1683
p 2 ⇤ 3982 1670 3325
47
(digital,cos(digital,
information)
The model information)
decides
= p2⇤that +
= information
2 2 is +
way p 2 p
closer⇤ to 2digital than it
= is =
to
.996
l, information)
The model = p
decides442 that+
5 2 8 + 25
information
1683 2 2+51683 +
is
1670p 3982
2way
+2 1670 +
closer
5 2 23325
5to
39822+ 23982
digital 2than
3325 + = 3325
2 .996
it is2 to
he model decides that information
result that seems sensible.2
5 +
+Fig.2 is
6.7 way+
shows closer
2 a to digital
+
visualization.
2 2 than
+ it
2 is to cher
5 ⇤1683
5Fig.
+ 1683 + 1670 ⇤ 3982a+ 5visualization.
+ 3982
1670 ⇤ 3325 + 3325
47
that
al, result
seems that
information) seems
sensible.
= psensible.
Fig. 6.7 shows 6.7 a shows
visualization.
p = .996
The model decides
The model decides that information that information
is way is
closer
2 + 16832 + 16702 52 + 39822 + 33252
way closer
to to
digital digital
than than
it is it
to is to
cherry, che
del decides
result that
that seems 5
information
sensible. is
Fig.way 6.7 closer
shows ato digital
visualization. than it is to cherry, a
t that seems sensible. Fig. 6.7 shows a visualization.
eemsdecides
odel sensible.
thatFig. 6.7 shows
information a visualization.
is way closer to digital than it is to cherry, a
seems sensible. Fig. 6.7 shows a visualization.
fool 36
Term
0.012
frequency (tf) ✓ ◆
If we usegood 37 0
log weighting, terms which occur 0 times in a N
document would hav
idf = log
Two common solutions for word weighting
Word Vector: tf-idf t 10
tf = log10sweet 37 tf
(1) = 0, 10 times 0t,da =document
in tf = log10 (11) = 1.4, df
count(t,d) 100
t times tf =
log (101) = 2.004, 1000 times tf = 3.00044, and so on.
10
hting
• Theof the
TF-IDF: value
secondHere for
factor word
WareEIGHING
insome t in
tf-idf idfdocument
is values
TERMS
used ford,asome
to give wt,dTHE
IN
higherthus
words combines
weightintothe
VECTOR that13
Shakespeare
words occu
h idf:tf-idf
only in a :fewtf-idf value informative
documents.
extremely forTerms
word t in
that document
are limited
Instead of using rawoccur
words which d:
to a count,
few in
documents
we one
only are
squash usefu
play alik
Term frequency (tf)
for discriminating occur those
in adocuments
few like salad from the or rest of theto
Falstaff, collection;
those whichtermsare thatvery
occuc
wt,d =the tft,d ⇥ idf (6.13) frequenc
number of documents
frequently
Document common as toin
across
frequency
entire
(df)
t many collections,
collection
be completelyaren’t as helpful. The
non-discriminative this measure
document since they oc
with atfInverse document frequency (idf)
dft of a term t is the numbertf3of documents
log
df weighting = count(t,d)
good
function.
to as
the or sweet.
Shakespeare The t,d = log
resulting10
term-document
it occurs
(count(t,d)+1) in.
definition
Document
for
frequency
inverse
i
t,d
not the Words
same like
the"the" or "it"frequency
collection have veryoflow idfmatrix
a term, which inisFig.
the 6.2.
total number o
isvalues
thus
times
df
for the dimension
the word appears in the corresponding
whole collection to Word
the word
in any good
document.
df have
Consider
idf in th
since PMI: t isword
this
collection the
of number
(Pointwise
appears
Shakespeare’s in of
mutual every
37 documents
information)
playsdocument,
the two words ttheoccurs
tf-idf
Romeo in.
algorithm
and action. The word
Instead of using raw count,
#(% Romeo
,%" ) we squash a bit: 1 1.57
ed inhave ◦identical
any
(notePMI !collection
comparison
this!, !is" not ✓
frequencies
of=the
$%& ◆ (they
plays. Similarly,
collection ! both the
occur
frequency:
salad
113
word timescount
fool,
total 2
in all theacross
which 1.27
plays) bu
very
f the 37 different
plays, document
has a much N # %! #(%" )
frequencies,
lower since Romeo only occurs in a single play. I
weight.
our
all
idf
goal
documents)
t is =to log
find 10
documents about the romanticFalstaff
tribulations 4of 0.967
Romeo, (6.13)
the wor
hting istfbySee far
= logthe dominant
if words df way
like "good"
(count(t,d)+1) t of weighting
appear more co-occurrence
often with 12
forest
ma-
"great" than
0.489
"Romeo"
Romeo t,dshould be
n retrieval, but also 10is very
highly distinctive
playsweighted,
we would expect by chancea role inbut notfor
many oneaspects
action:
other Shakespeare
of natural play:
battle 21 0.246
Collection Frequency Document Frequency
for some
hakespeare’s
N is words
favorite
the total in the
adjectives,
number Shakespeare
a fact
of
probably related corpus,
documents
wit ranging
to the increased of from
34 use0.037
es around the turn of the 16th century (Jurafsky, 2014, 1p.
Romeo 113
fool
175). 36 0.012
rds which occur
actionin113
in the collection only one play like 31 Romeo, to those that
good 37 0
r Falstaff, to those
We emphasize which are
discriminative wordsvery common
like Romeo
sweet via the like
37 fool
inverse or so fre
0document
Pointwise Mutual
Word Vector: Information
Pointwise Mutual Information
-($%&'! , $%&'" )
PMI $%&'! , $%&'" = log "
- $%&'! -($%&'" )
Word Vector: Pointwise Mutual Information
Positive Pointwise Mutual Information
◦ PMI ranges from −∞ to + ∞
◦ But the negative values are problematic
◦ Things are co-occurring less than we expect by chance
◦ Unreliable without enormous corpora
◦ Imagine w1 and w2 whose probability is each 10-6
◦ Hard to be sure p(w1,w2) is significantly different than 10-12
◦ Plus it’s not clear people are good at “unrelatedness”
◦ So we just replace negative PMI values by 0
◦ Positive PMI (PPMI) between word1 and word2:
0($%&'! , $%&'" )
PPMI $%&'! , $%&'" = max log " ,0
0 $%&'! 0($%&'" )
6.6 • P OINTWISE M UTUAL I NFORMATION (PMI) 15
context c j . This can be turned into a PPMI matrix where ppmii j gives the PPMI
Computing Positive Pointwise Mutual Informationvalue of word wi with context c j as follows:
Fig. 6.11 shows the joint probabilities computed from the counts in Fig. 6.10, and
Sparse versus dense
Word to Vector (word2vec)
vectors
apricot apricot
jam jam apricotapricot
where where apricotapricot
dear dear
apricot aapricot a apricot apricot
coaxial coaxial
apricotapricot
if if
For training
training a binarya classifier
binary classifier
we alsoweneed
alsonegative
need negative examples.
examples. In factInskip-
fact
gram
ses usesnegative
more more negative
examplesexamples than positive
than positive examples examples (with
(with the the between
ratio ratio bet
tthem
by a set by a parameter
parameter k). each
k). So for So for
ofeach
theseof(t,
these (t, c) training
c) training instances
instances we’ll c
we’ll create
k negative samples, each consisting of the target t plus a ‘noise word’. A noise
nce the ); here the first term expresses that we want the classifier to assi
imize the similarity of the target with the actual context words
context word cpos a high probability of being a neighbor, and the second
minimize thewant
ressesSkip-gram
that we similarity
to assign
Training of the
each target
of the with
noise the
words k
c negative
negi a high samp
probabi
neighbor
ng words.all multiplied because we assume independence:
a non-neighbor,
" k
#
Y
LCE = log P(+|w, cpos ) P( |w, cnegi )
i=1
" k
#
X
= log P(+|w, cpos ) + log P( |w, cnegi )
i=1
" k
#
X
= log P(+|w, cpos ) + log 1 P(+|w, cnegi )
i=1
" k
#
X
= log s (cpos · w) + log s ( cnegi · w)
i=1
t is, we want to maximize the dot product of the word with the actual c
ds, and minimize the dot products of the word with the k negative sample
the ); here the first term expressesi=1 that we want the classifier to assig
ntext word cpos "a high probability ofkbeing a neighbor, and the # second
Skip-gram X
ses that toTraining
we=want log assign
P(+|w,each
cposof
) +the noise
log 1wordsP(+|w, cnegi caneg
high) probabili
The derivatives
a non-neighbor, all multiplied
V ECTOR S EMANTICS AND
of the
because loss function
i=1we assume independence:
E MBEDDINGS
i
" " k
# #
Y
X
LCE == log log sP(+|w, cpos+)
(cpos · w) log s|w,
P( ( ccneg
negi )i · w) (6
as an exercise at the end of the chapter):
i=1
" #
k of the word with the actual con
we want ∂toLmaximize
CE
the dot product
X
and minimize = the dot =log [s
P(+|w,
(cpos c·pos
products w) +1]w
of)the logwith
word P( the
|w, ckneg )
negative sampled
∂ cpos i=1
i
r words. " #
minimize ∂this
LCEloss function using stochastic
= [s (c · w)]w X k
gradient descent. Fig.
neg
he intuition∂=cofnegonelog P(+|w,
step cpos ) +
of learning. log 1 P(+|w, cnegi )
i=1
" k
X #
∂ LCE Xk
= [s (cpos · w) 1]cpos +
aardvark [s (cnegi · w)]cnegi
∂ w log s (cpos · w)move
= + log s ( cnegi · w)
apricot and jam closer,
(
i=1
i=1
increasing cpos w
apricot w
s,pdate
weW want to maximize
equations going the
fromdot product
time step of
t the
to t +word
1 in with the actual
stochastic co
gradie
and minimize the dot products of the word with the k negative sampled
us:
Skip-gram Training
The kinds of neighbors depend on window size
Small windows (C= +/- 2) : nearest words are syntactically
similar words in same taxonomy
◦Hogwarts nearest neighbors are other fictional schools
◦Sunnydale, Evernight, Blandings
Large windows (C= +/- 5) : nearest words are related
words in same semantic field
◦Hogwarts nearest neighbors are Harry Potter world:
◦Dumbledore, half-blood, Malfoy