0% found this document useful (0 votes)
9 views36 pages

4.machine Learning Word Embedding-1

Uploaded by

minhtrivip2808
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views36 pages

4.machine Learning Word Embedding-1

Uploaded by

minhtrivip2808
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Natural Language Processing

Machine learning for Word meaning

Tien-Lam Pham
Contents

Word Meaning
Word similarity
Tf-idf, PMI
Word Embedding
Skip gram models
WhatWord
do words mean?
Meaning
N-gram or text classification methods we've seen so far
◦ Words are just strings (or indices wi in a vocabulary list)
◦ That's not very satisfactory!
Lemmaslogic
Introductory and classes:
senses
◦ The meaning of "dog" is DOG; cat is CAT
lemma
∀x DOG(x) ⟶ MAMMAL(x)
mouse
Old linguistics joke(N)
by Barbara Partee in 1967:
◦ Q: What's1.the
anymeaning
of of life? small rodents...
numerous
◦ sense
A: LIFE
2. a hand-operated device that controls
That seems hardly better!
a cursor... Modified from the online thesaurus WordNet

A sense or “concept” is the meaning component of a word


Lemmas can be polysemous (have multiple senses)
Word Meaning

Relations between senses: Synonymy


Synonyms have the same meaning in some or all
contexts.
◦ filbert / hazelnut
◦ couch / sofa
◦ big / large
◦ automobile / car
◦ vomit / throw up
◦ water / H20
Relation: Similarity
Word Relation

Words withAsk humans


similar meanings.how similar 2butwords
Not synonyms, sharing are
some element of meaning

word1 word2 similarity


car, bicycle
vanish disappear 9.8
cow, horse
behave obey 7.3
belief impression 5.95
muscle bone 3.65
modest flexible 0.98
hole agreement 0.3

SimLex-999 dataset (Hill et al., 2015)


Word Relation

Relation: Word relatedness


Also called "word association"
Words can be related in any way, perhaps via a semantic
frame or field

◦ coffee, tea: similar


◦ coffee, cup: related, not similar
Semantic
Semantic Field field

Words that
◦ cover a particular semantic domain
◦ bear structured relations with each other.

hospitals
surgeon, scalpel, nurse, anaesthetic, hospital
restaurants
waiter, menu, plate, food, menu, chef
houses
door, roof, kitchen, family, bed
Relation:Field
Semantic Antonymy
Senses that are opposites with respect to only one
feature of meaning
Otherwise, they are very similar!
dark/light short/long fast/slow rise/fall
hot/cold up/down in/out
More formally: antonyms can
◦ define a binary opposition or be at opposite ends of a scale
◦ long/short, fast/slow
◦ Be reversives:
◦ rise/fall, up/down
Connotation
Word similarity
(sentiment)

• Words have affective meanings


• Positive connotations (happy)
• Negative connotations (sad)
• Connotations can be subtle:
• Positive connotation: copy, replica, reproduction
• Negative connotation: fake, knockoff, forgery
• Evaluation (sentiment!)
• Positive evaluation (great, love)
• Negative evaluation (terrible, hate)
Word Meaning
So far
Concepts or word senses
◦ Have a complex many-to-many association with words (homonymy,
multiple senses)
Have relations with each other
◦ Synonymy
◦ Antonymy
◦ Similarity
◦ Relatedness
◦ Connotation
Computational models of word meaning
Computational Models for Word meaning
Ludwig Wittgenstein
Can we build a theory of how to represent word
meaning, that accounts for at least some of the
desiderata?
PI #43:
Let's
We'll define
introduce words
vectorby their usages
semantics
"The meaning of a word is its use in the language"
The standard model in language processing!
One way to define "usage":
Handles many of our goals!
words are defined by their environments (the words around them)

Zellig Harris (1954):


If A and B have almost identical environments we say that they
are synonyms.
Computational Models for Word meaning

Idea 1: Defining meaning by linguistic distribution

Let's define the meaning of a word by its


distribution in language use, meaning its
neighboring words or grammatical environments.
Computational Models for Word meaning
Idea 1: Defining meaning by linguistic distribution

Idea 2: Meaning as a point in multidimensional space


Idea 2: Meaning as a point in space (Osgood et al. 1957)
3 affective dimensions for a word
◦ valence: pleasantness
◦ arousal: intensity of emotion
◦ dominance: the degree of control exerted
Word Score Word Score
Valence love 1.000 toxic 0.008
happy 1.000 nightmare 0.005
Arousal elated 0.960 mellow 0.069 NRC VAD Lexicon
frenzy 0.965 napping 0.046 (Mohammad 2018)

Dominance powerful 0.991 weak 0.045


◦ leadership 0.983 empty 0.081

Hence the connotation of a word is a vector in 3-space


Computational Models for Word meaning

Defining meaning as a point in space based on distribution


Each word = a vector (not just "good" or "w45")
Similar words are "nearby in semantic space"
We build this space automatically by seeing which words are
ER 6 • V ECTOR S EMANTICS AND E MBEDDINGS
nearby in text
not good
bad
to by dislike worst
’s
that now incredibly bad
are worse
a i you
than with is

very good incredibly good


amazing fantastic
terrific wonderful
nice
good

Figure 6.1 A two-dimensional (t-SNE) projection of embeddings for some words and
phrases, showing that words with similar meanings are nearby in space. The original 60-
Computational Models for Word meaning

We define meaning of a word as a vector


Called an "embedding" because it's embedded into a
space (see textbook)
The standard way to represent meaning in NLP
Every modern NLP algorithm uses embeddings as
the representation of word meaning
Fine-grained model of meaning for similarity
Computational Models for Word meaning
Intuition: why vectors?
Consider sentiment analysis:
◦ With words, a feature is a word identity
◦ Feature 5: 'The previous word was "terrible"'
◦ requires exact same word to be in training and test
◦ With embeddings:
◦ Feature is a word vector
◦ 'The previous word was vector [35,22,17…]
◦ Now in the test set we might see a similar vector [34,21,14]
◦ We can generalize to similar but unseen words!!!
We'll discuss 2Models
Computational kinds for
of embeddings
Word meaning

tf-idf
◦ Information Retrieval workhorse!
◦ A common baseline model
◦ Sparse vectors
◦ Words are represented by (a simple function of) the counts of nearby
words
Word2vec
◦ Dense vectors
◦ Representation is created by training a classifier to predict whether a
word is likely to appear nearby
◦ Later we'll discuss extensions called contextual embeddings
just so they fit on the page; in real term-document matrices, the vectors representing
battleeach document 1 would have dimensionality
0 7
|V |, the vocabulary size. 13
good Term-document matrix
114 of the numbers in
The ordering
Term-Document
80a vector space indicates62 different meaningful
89 di-
fool mensions which documentsMatrix
on 36 58 Thus the first dimension
vary. 1 4 vectors
for both these
wit corresponds 20 to the number of times 15 the word battle occurs,2 and we can 3 compare
Figure 6.3 dimension,
Each
each document is represented
The term-document
noting formatrix forby
example a vector
four
that words
the inoffour
vectors words
forShakespeare
As You Likeplays.
It andThe red
Twelfth
oxes Night
show that
haveeach document
similar valuesis(1represented as a column
6.3for
and 0, respectively) vector
•theW ofORDS
length
first four.V ECTORS 7
AND
dimension.

We can thinkAs ofYou


As theLike
You vector
Like It for aTwelfth
It document
Twelfth as a point
Night
Night in |V
Julius |-dimensional
Caesar Henryspace;
V
hus the documents in Fig.
battle 11 6.3 are points in004-dimensional space. 7 Since 4-dimensional
13
pacesgood
are hard to visualize,
114
114 Fig. 6.4 shows80a visualization in62
80 two dimensions;89we’ve
Visualizing
rbitrarily
wit 20
document
fool chosen the 36
36 vectors
dimensions
15
58
58
corresponding to the words1 battle and fool.4
2 3
wit 20 15
6.2 The term-document matrix for four
Figure 6.3 four words
words in
in four
fourShakespeare
Shakespeareplays.
plays.Each
The cell
red
contains
boxes thethat
40show number
eachof times theis(row)
document word occurs
represented in the (column)
as a column vector of document.
length four.
Henry V [4,13]
represented
We as a count
15 can think of thevector,
vectoraforcolumn in Fig. as
a document 6.3.a point in |V |-dimensional space;
battle

r thusTo
thereview someinbasic
documents linear
Fig. 6.3 arealgebra, vector is, at heart,
points ina4-dimensional space.just a list
Since or array of
4-dimensional
10 are So
numbers.
spaces AstoYou
Julius
hard Like [1,7]
Caesar It is
visualize, represented
Fig. 6.4 showsas the list [1,114,36,20]
a visualization (the first column
in two dimensions; we’ve
vector in Fig.
arbitrarily 6.3)the
chosen and Julius Caesar
dimensions is represented
corresponding to theaswords
the list [7,62,1,2]
battle (the third
and fool.
5 As You Twelfth Night [58,0] by their
e column vector). A vector space is a Like It [36,1]
collection of vectors, characterized
n dimension. In the example in Fig. 6.3, the document vectors are of dimension 4,
just so they
40 fit on the page; in real term-document matrices, the vectors representing
5 Henry 15 20 25 30 35 40 45 50 55 60
10 V [4,13]
each document would have dimensionality |V |, the vocabulary size.
15
The ordering of the numbers in a vector fool space indicates different meaningful di-
e
vector, hence with different dimensions, as shown in Fig. 6.5. The four dimensions
Idea for word meaning: Words can be vectors too!!!
of the vector for fool, [36,58,1,4], correspond to the four Shakespeare plays. Word
Word
counts in theVector?
same four dimensions are used to form the vectors for the other 3
words: wit, [20,15,2,3]; battle, [1,0,7,13]; and6.3 W ORDS AND V ECTORS 7
good• [114,80,62,89].

As You
As You Like
Like It
It Twelfth Night
Twelfth Night JuliusCaesar
Julius Caesar HenryVV
Henry
battle
battle 11 00 77 13
13
good 114
114 80
80 62
62 89
89
fool 36
36 58
58 11 44
wit 20
20 15
15 22 33
Figure
Figure 6.5
6.2 The term-document matrixmatrix for
for four
fourwords
wordsininfour
fourShakespeare
Shakespeareplays.
plays.Each
Thecell
red
boxes show
contains thethat eachof
number word is the
times represented as aoccurs
(row) word row vector
in theof length four.
(column) document.
battle is "the kind of word that occurs in Julius Caesar and Henry V"
For documents,
represented we vector,
as a count saw thata similar
columndocuments
in Fig. 6.3.had similar vectors, because sim-
ilar documents
foolTo isreview tend
some
"the kind toword
ofbasichave similar
linear
that words.
algebra,
occurs This same
ainvector
comedies, principle
is, at heart, justapplies
especially list ortoarray
aTwelfth words:
of
Night"
similar
numbers. words have
So As Yousimilar
Like Itvectors becauseasthey
is represented the tend to occur in similar
list [1,114,36,20] documents.
(the first column
The
vectorterm-document matrix
in Fig. 6.3) and thusCaesar
Julius lets usisrepresent
represented the meaning
as the listof[7,62,1,2]
a word by(thethethird
doc-
uments
column itvector). vector
tends toAoccur in. space is a collection of vectors, characterized by their
dimension. In the example in Fig. 6.3, the document vectors are of dimension 4,
just so they
6.3.3 fit on the
Words as page; in realword
vectors: term-document
dimensions matrices, the vectors representing
each document would have dimensionality |V |, the vocabulary size.
An alternative
The ordering to of
using the term-document
the numbers in a vector matrix to represent
space indicates words
different as vectorsdi-
meaningful of
document
mensions on counts, to use the term-term
whichisdocuments vary. Thus the matrix, also calledfor
first dimension word-word
theboth ma-
these vectors
the number of times
of a word’s therepresentation.
vector row (target) word andthan
Rather thethe column (context) matrix
term-document word co-occur
we use the
word-word
in some context in
term-term some more
matrix, training corpus. called
commonly The context could bematrix
the word-word the document, in
or the term-
matrix
whichcontext
case thematrix,
cell represents
in whichthe
thenumber
columnsofaretimes
labeledthe two wordsrather
by words appear in documents.
than the same
Word Vector?
This matrix
document. is thus
It is most of dimensionality
common, however, |Vto |use | and each
⇥ |Vsmaller cell records
contexts, the number
generally a win-of
times the
dow around therow (target)
word, word and of
for example the4column
words (context)
to the leftword
andco-occur
4 wordsintosome context
the right,
in some
case training
the cell corpus. Thethe
context could
of be the document, in whichcorpus)
case thethe
cell
More common: word-word matrix
in which
column represents
word occurs
represents
the number
in suchof times
a ±4
number
the window
word two words
times (in some training
appear
around theinrow
the word.
same document.
For example It is

(or "term-context
here ismost common,each
one example
matrix")
however,
word, for example of 4 words
to use
of some smaller
words contexts,
in their windows:generally a window around the
to the left and 4 words to the right, in which case
is cell
the traditionally
representsfollowed by of
the number cherry pie,training
times (in some a traditional
corpus)dessert
the column word
occurs inoften
suchmixed,
a ±4 word as strawberry
suchwindow around the rhubarb
row word. pie.For
Apple pie here is one
example
Two words
computer are
eachsimilar
peripherals
example of and in meaning
somepersonal
words in digital if their
their windows: context
assistants. vectors
These devicesare similar
usually
a computer. This includes information available on the internet
If we thenistake
traditionally followed by
every occurrence cherry
of each word (saypie, a traditionaland
strawberry) dessert
count the
often mixed, such as strawberry rhubarb pie. Apple pie
context words around it, we get a word-word co-occurrence matrix. Fig. 6.6 shows a
computer peripherals and personal digital assistants. These devices usually
simplified subset of the word-word co-occurrence matrix for these four words com-
a computer. This includes information available on the internet
puted from the Wikipedia corpus (Davies, 2015).
If we then take every occurrence of each word (say strawberry) and count the con-
text wordsaardvark
around it, we
... getcomputer
a word-worddataco-occurrence
result matrix.
pie Fig. 6.5 shows
sugar ... a
simplified subset
cherry 0 of the...word-word 2 co-occurrence
8 matrix
9 for 442these four25words...com-
puted from the0 Wikipedia
strawberry ... corpus0(Davies, 2015).
0 1 60 19 ...
digital Note in Fig.0 6.5 that... the two
1670words cherry
1683 and strawberry
85 5 are more4 similar
... to
each other (both
information 0 pie and ... sugar3325
tend to occur in their378
3982 window) 5than they 13are to...other
Figurewords like digital; conversely,
6.6 Co-occurrence vectors fordigital and information
four words are more
in the Wikipedia similar
corpus, to each
showing sixother
of
than, say,(hand-picked
the dimensions to strawberry.
for Fig. 6.6 shows
pedagogical a spatialThe
purposes). visualization.
vector for digital is outlined in
red. Note that a real vector would have vastly more dimensions and thus be much sparser.
nsions (hand-picked for pedagogical purposes). The vector for digital is outlined
To measure similarity between two target words v and w, we need a metr
that a real vector would have vastly
takes more
two vectors dimensions
(of the and thus
same dimensionality, eitherbe
bothmuch sparser.
with words as dimen
Word Vector? hence of length |V |, or both with documents as dimensions as documents, of
|D|) and gives a measure of their similarity. By far the most common sim
metric is the cosine of the angle between the vectors.
4000 The cosine—like most measures for vector similarity used in NLP—is ba
dot product the dot product operator from linear algebra, also called the inner product:
12
computer
C HAPTER 6 • V ECTOR S EMANTICS
inner product information
N
3000 [3982,3325]
X
dot product(v, w) = v · w = vi wi = v1 w1 + v2 w2 + ... + vN wN
digital i=1
2000 [1683,1670]
As we will see, most metrics for similarity between vectors are based on t
~a ·~bas =
product. The dot product acts |~a||~b| cos
a similarity q because it will tend to b
metric
1000 just when the two vectors have large values in the same dimensions. Alterna
·~b
vectors that have zeros in~a different
= dimensions—orthogonal
cos q vectors—will
dot product of 0, representing
|~a||~b|their strong dissimilarity.
This raw dot product, however, has a problem as a similarity metric: it
cosine vector length long similarity
The cosine vector length
vectors. Themetric is defined
between two as
vectors ~v and ~w thus can be
1000 2000 3000 4000
as: Alternative: cosine
data for computing word simila v
u N
uX
|v| = t v2i
i=1
N
X
6 A spatial visualization of word vectors for digital and information, vi wi showing ju
The dot product is higher if a vector
v
~ · w
~ is longer, with higher values in each dime
e dimensions, correspondingMore
to the words
frequent words data
cosine(~v,~w) =
have and
longer computer.
vectors,
=v since
i=1
they
vtend to co-occur with
words and have higher co-occurrence uwith
|~v||~w| values N eachuof Nthem. The raw dot p
uX uX
thus will be higher for frequent words. But tthis isva2 t 2
problem;wwe’d like a sim
i i
that |V |, the length of the vector, is generally the size of the vocabular
metric that tells us how similar two words are regardless of their frequency.
i=1 i=1
We modify the dot product to normalize for the vector length by dividi
rsbythat
|a|.
ButBut Let’s
are
For
raw
Therawunit see
orthogonal, how
vectors,
frequency
frequency
cosine valuetheto cosine
the -1dot
values
values
ranges for
are
are
fromcomputes
vectors
product
non-negative,
non-negative,
1 for which
is pointing
the
vectors soof
same
so the
in
the
the
pointing words
opposite
ascosine
the
cosine
in cosine.
the cherry
for
for
same these
these orvector
directions. digith
vecto
direction,
0–1.
vectors
sine
quency
et’s
raw see
inCosine examples
value
from
from
0 for
how
frequency
that aretoorthogonal,
meaning
ranges
values
0–1.
0–1.
vectors
the
are information,
from
that are
cosine
values
1
are for
to just
vectors
non-negative,
orthogonal,
computes
non-negative,
-1 for using
whichto -1
vectors
pointing
so the raw
sooffor
in
cosine
the
the
pointing
counts
the
vectors
words
cosine
same
for from in the
these
pointing
cherry
for
opposite
these
following
direction,
vectors
direction
through
inoropposite
digital
vectors
shor
rangesdirec
is
rangcl
ors that Word
But Let’s
are
Let’s
raw seeseehow
orthogonal,
Vector? howthe
frequency the cosine
tocosine
values -1 for are computes
vectors which
computes
non-negative, which
pointing
pie so of
of the
in
the
the
data words
opposite
words
cosine
computer cherry
cherry
for these or
directions.
or digital
digital
vectors r
e
0–1. in
aning
equency
how
Let’s see
Cosine
to meaning
infrom
the
information,
how
values
meaning
0–1.
cosine toto
 the cosine
examples
are
just usingjust
information,
non-negative,
information,
computes
 computes
just
which
N
rawusing
cherryso
of
which
counts
using the
the raw
raw
of
cosine
442
from
counts
counts
words
the words8
the
forfrom
cherry
following
from
these the
the
or
pie
cherry2
shortened
following
vectors
following
digital
ordata
digitalis shorten
ranges
shorten
closer
computer
is clos
ta
to  Let’s
information, v • w seev how
just w the cosine
using

raw
v w
computes
countspie
digital
i i
datawhich
from pie 5 the of the
computer words
following
1683
data 1670
computer cherry
shortened or digital
table: is
eaning to information,
cos( v, w) =   =   just using
• = i=1
raw counts pie cherry
fromdata the 442 8shortened
computer
following 2 tabl
ee how in the
meaning
v w
cosine to
v information,
w
computes
cherry

N
v 2just using
which
N
442
information
cherry
∑ ofw 2 raw counts from the following shortened
the words
8442
442 5digitalcherry
8 2
3982 or digital
3325
pie
2 data is closer
computer
   

Ni cherry i 8 25 1683 1670
g tocos(   v •w v w
information,
v, w) =   = just using
i=1
digital
rawpie vi wdata
counts
i=1 digital
i i=1
pie5 computer
data
1683
from computer
the
5cherry 1670
following
1683 1670
442 shortened
8 2 table:
 

vcherry
= digital 5
pie 1683
data 1670
computer
information 5 3982 3325
v w
information
∑ 442
w cherry N
∑8w 2 8
442
v 2information
information
cherry 5
N
442
3982
4425 2⇤3982
5
5digital8 2 8 ⇤ 3982
3982
3325
+ 3325
3325
25 2 ⇤ 3325
+1683 1670
cos(cherry,digital
information) i=1 i
digitalpie = data
5 digital
i=1
1683 p i
5 1683 computer
1670 1670 p =
442 25 + 16838 2 + 22 1670 5 2 + 39822 + 33252
cherry 442 8 2information 5 3982 3325
information
information 5 442
information5⇤ 53982
3982 442
442
+ 853325
⇤ 5+
⇤53982+88⇤+
3325
3982 3982
⇤3982
2
3325
⇤ +22⇤⇤3325
3325
+ 3325
cherry, cos(cherry,
information)
cos(cherry, information)
digital
information)
= p 5 = =1683p p 51670 ⇤p 5 + 1683 pp⇤ 3982 + 1670 = ⇤ 3325
.017 == .0.0
cos(digital, information) = 2 +8 p 442
2+
2 + 82 + 22 52 +p
2+ 282 5 222+2 3982 2 +2 3982
23982
2 + 33252
2 + 23325 2
information4425⇤442 442 3982 ⇤ 442
5 5 2
442
+ +
82 1683
3325 53982+
8 +
+ 25
1670
3982⇤ 33252+ 5 3325
2 +
3325 3982 2 + 332
information)
cos(cherry, information) 5 +
=5=⇤ 5p+p 8 ⇤ 53982
5 ⇤ ⇤
55 + +
+16832
1683⇤⇤ 3325
⇤ 3982
3982
+ +⇤ 1670
1670 ⇤ 3325
3325
(cherry,
y, information)
cos(digital, = =
information)
p p p 1683
⇤ p + ⇤ 3982 p ⇤+ 1670
p + ⇤ 3325
= ⇤=
.017 .017 = .017
digital, cos(digital,
information) information)
The model decides = p2 442
that = 2+
2 p 8 2+
442
information
52 2 2+2 282 5
1683
2 is+ 222+
way2p 3982
5 2 +22p
closer
1670
2 3982
+ to
5 2 2 + 23325
3325
digital
2 3982 2 2
than it
3325= is
.9
2
442442 + 2
5 5+⇤8 5 +
+ 2
85 ⇤ 2
+
2 398251683 ++ 3982
22+ 2
+
⇤ 3325
1670 +
2 233255
+
2 +239822 + 3325 + 2 2
ry, information)
result that seems
= p sensible. ⇤1683
Fig. 5 +6.7 p
+ ⇤+5 1670
5 shows
1683 +⇤ 1683
3982
a ⇤+53982
++
1670
visualization.3982
⇤1670
3325
= +
.0173325
⇤ 3325
5 5 1683
p 2 ⇤ 3982 1670 3325
47
(digital,cos(digital,
information)
The model information)
decides
= p2⇤that +
= information
2 2 is +
way p 2 p
closer⇤ to 2digital than it
= is =
to
.996
l, information)
The model = p
decides442 that+
5 2 8 + 25
information
1683 2 2+51683 +
is
1670p 3982
2way
+2 1670 +
closer
5 2 23325
5to
39822+ 23982
digital 2than
3325 + = 3325
2 .996
it is2 to
he model decides that information
result that seems sensible.2
5 +
+Fig.2 is
6.7 way+
shows closer
2 a to digital
+
visualization.
2 2 than
+ it
2 is to cher
5 ⇤1683
5Fig.
+ 1683 + 1670 ⇤ 3982a+ 5visualization.
+ 3982
1670 ⇤ 3325 + 3325
47

that
al, result
seems that
information) seems
sensible.
= psensible.
Fig. 6.7 shows 6.7 a shows
visualization.
p = .996
The model decides
The model decides that information that information
is way is
closer
2 + 16832 + 16702 52 + 39822 + 33252
way closer
to to
digital digital
than than
it is it
to is to
cherry, che
del decides
result that
that seems 5
information
sensible. is
Fig.way 6.7 closer
shows ato digital
visualization. than it is to cherry, a
t that seems sensible. Fig. 6.7 shows a visualization.
eemsdecides
odel sensible.
thatFig. 6.7 shows
information a visualization.
is way closer to digital than it is to cherry, a
seems sensible. Fig. 6.7 shows a visualization.
fool 36
Term
0.012
frequency (tf) ✓ ◆
If we usegood 37 0
log weighting, terms which occur 0 times in a N
document would hav
idf = log
Two common solutions for word weighting
Word Vector: tf-idf t 10
tf = log10sweet 37 tf
(1) = 0, 10 times 0t,da =document
in tf = log10 (11) = 1.4, df
count(t,d) 100
t times tf =
log (101) = 2.004, 1000 times tf = 3.00044, and so on.
10
hting
• Theof the
TF-IDF: value
secondHere for
factor word
WareEIGHING
insome t in
tf-idf idfdocument
is values
TERMS
used ford,asome
to give wt,dTHE
IN
higherthus
words combines
weightintothe
VECTOR that13
Shakespeare
words occu
h idf:tf-idf
only in a :fewtf-idf value informative
documents.
extremely forTerms
word t in
that document
are limited
Instead of using rawoccur
words which d:
to a count,
few in
documents
we one
only are
squash usefu
play alik
Term frequency (tf)
for discriminating occur those
in adocuments
few like salad from the or rest of theto
Falstaff, collection;
those whichtermsare thatvery
occuc
wt,d =the tft,d ⇥ idf (6.13) frequenc
number of documents
frequently
Document common as toin
across
frequency
entire
(df)
t many collections,
collection
be completelyaren’t as helpful. The
non-discriminative this measure
document since they oc
with atfInverse document frequency (idf)
dft of a term t is the numbertf3of documents
log
df weighting = count(t,d)
good
function.
to as
the or sweet.
Shakespeare The t,d = log
resulting10
term-document
it occurs
(count(t,d)+1) in.
definition
Document
for
frequency
inverse
i
t,d
not the Words
same like
the"the" or "it"frequency
collection have veryoflow idfmatrix
a term, which inisFig.
the 6.2.
total number o
isvalues
thus
times
df
for the dimension
the word appears in the corresponding
whole collection to Word
the word
in any good
document.
df have
Consider
idf in th
since PMI: t isword
this
collection the
of number
(Pointwise
appears
Shakespeare’s in of
mutual every
37 documents
information)
playsdocument,
the two words ttheoccurs
tf-idf
Romeo in.
algorithm
and action. The word
Instead of using raw count,
#(% Romeo
,%" ) we squash a bit: 1 1.57
ed inhave ◦identical
any
(notePMI !collection
comparison
this!, !is" not ✓
frequencies
of=the
$%& ◆ (they
plays. Similarly,
collection ! both the
occur
frequency:
salad
113
word timescount
fool,
total 2
in all theacross
which 1.27
plays) bu
very
f the 37 different
plays, document
has a much N # %! #(%" )
frequencies,
lower since Romeo only occurs in a single play. I
weight.
our
all
idf
goal
documents)
t is =to log
find 10
documents about the romanticFalstaff
tribulations 4of 0.967
Romeo, (6.13)
the wor
hting istfbySee far
= logthe dominant
if words df way
like "good"
(count(t,d)+1) t of weighting
appear more co-occurrence
often with 12
forest
ma-
"great" than
0.489
"Romeo"
Romeo t,dshould be
n retrieval, but also 10is very
highly distinctive
playsweighted,
we would expect by chancea role inbut notfor
many oneaspects
action:
other Shakespeare
of natural play:
battle 21 0.246
Collection Frequency Document Frequency
for some
hakespeare’s
N is words
favorite
the total in the
adjectives,
number Shakespeare
a fact
of
probably related corpus,
documents
wit ranging
to the increased of from
34 use0.037
es around the turn of the 16th century (Jurafsky, 2014, 1p.
Romeo 113
fool
175). 36 0.012
rds which occur
actionin113
in the collection only one play like 31 Romeo, to those that
good 37 0
r Falstaff, to those
We emphasize which are
discriminative wordsvery common
like Romeo
sweet via the like
37 fool
inverse or so fre
0document
Pointwise Mutual
Word Vector: Information
Pointwise Mutual Information

Pointwise mutual information:


Do events x and y co-occur more than if they were independent?
P(x,y)
PMI(X,Y ) = log 2
P(x)P(y)
PMI between two words: (Church & Hanks 1989)
Do words x and y co-occur more than if they were independent?

-($%&'! , $%&'" )
PMI $%&'! , $%&'" = log "
- $%&'! -($%&'" )
Word Vector: Pointwise Mutual Information
Positive Pointwise Mutual Information
◦ PMI ranges from −∞ to + ∞
◦ But the negative values are problematic
◦ Things are co-occurring less than we expect by chance
◦ Unreliable without enormous corpora
◦ Imagine w1 and w2 whose probability is each 10-6
◦ Hard to be sure p(w1,w2) is significantly different than 10-12
◦ Plus it’s not clear people are good at “unrelatedness”
◦ So we just replace negative PMI values by 0
◦ Positive PMI (PPMI) between word1 and word2:
0($%&'! , $%&'" )
PPMI $%&'! , $%&'" = max log " ,0
0 $%&'! 0($%&'" )
6.6 • P OINTWISE M UTUAL I NFORMATION (PMI) 15

context c j . This can be turned into a PPMI matrix where ppmii j gives the PPMI
Computing Positive Pointwise Mutual Informationvalue of word wi with context c j as follows:

Computing PPMI on a term-context


p = p =
matrixp = ij PW
fi j
PC
fi j
i⇤
PC
PW PC
j=1 f i j
⇤j
PW
PW i=1
PC
fi j
(6.19)
i=1 j=1 i=1 j=1 f i j i=1 j=1 fi j

Matrix F with W rows (words) and C columns (contexts) PPMIi j = max(log2


pi j
pi⇤ p⇤ j
, 0) (6.20)

fij is # of times wi occurs in context


count marginals, c
Let’s see some PPMI calculations. We’ll use Fig. 6.10, which repeats Fig. 6.6 plus
all the j let’s pretend for ease of calculation that these are the
and
only words/contexts that matter.

C W computer data result pie sugar count(w)

fij ∑ fij ∑ fij cherry


strawberry
2
0
8
0
9
1
442
60
25
19
486
80
pij = W C pi* = j=1 p* j = i=1
digital 1670 1683 85 5 4 3447
W C W C
information 3325 3982 378 5 13 7703
∑∑ fij ∑∑ fij ∑∑ fij
i=1 j=1 i=1 j=1 i=1 j=1 count(context) 4997 5673 473 512 61 11716
Figure 6.10 Co-occurrence counts for four words in 5 contexts in the Wikipedia corpus,
together with the marginals, pretending for the purpose of this calculation that no other
words/contexts matter.
pij !# pmi if pmi > 0
ijfor example we ijcould compute PPMI(w=information,c=data), assuming
pmiij = log 2 ppmiij = " we pretended that Fig. 6.6 encompassed all the relevant word contexts/dimensions,
Thus
pi* p* j 0
#$ as follows: otherwise
3982
P(w=information,c=data) = = .3399
11716 62
7703
P(w=information) = = .6575
11716
5673
P(c=data) = = .4842
11716
ppmi(information,data) = log 2(.3399/(.6575 ⇤ .4842)) = .0944

Fig. 6.11 shows the joint probabilities computed from the counts in Fig. 6.10, and
Sparse versus dense
Word to Vector (word2vec)
vectors

tf-idf (or PMI) vectors are


◦ long (length |V|= 20,000 to 50,000)
◦ sparse (most elements are zero)
Alternative: learn vectors which are
◦ short (length 50-1000)
◦ dense (most elements are non-zero)
Word to Vector (word2vec)
Sparse versus dense vectors
Why dense vectors?
◦ Short vectors may be easier to use as features in machine
learning (fewer weights to tune)
◦ Dense vectors may generalize better than explicit counts
◦ Dense vectors may do better at capturing synonymy:
◦ car and automobile are synonyms; but are distinct dimensions
◦ a word with car as a neighbor and a word with automobile as a
neighbor should be similar, but aren't
◦ In practice, they work better
69
Word to Vector (word2vec)
Common methods for getting short dense vectors

“Neural Language Model”-inspired models


◦ Word2vec (skipgram, CBOW), GloVe
Singular Value Decomposition (SVD)
◦ A special case of this is called LSA – Latent Semantic Analysis
Alternative to these "static embeddings":
• Contextual Embeddings (ELMo, BERT)
• Compute distinct embeddings for a word in its context
• Separate embeddings for each token of a word
◦ Is w likely to show up near "apricot"?
We don’t actually care about this task
Skip-gram
◦ But we'll take the learned classifier weights as the word embeddings
Big idea: self-supervision:
◦ A word c that occurs near apricot in the corpus cats as the gold "correct
answer" for supervised learning
◦ No need for human labels
Approach:
◦ Bengio et al. predict if candidate
(2003); Collobert et al. (2011) word c is a "neighbor

1. Treat the target word t and a neighboring context word c


as positive examples.
2. Randomly sample other words in the lexicon to get
negative examples
3. Use logistic regression to train a classifier to distinguish
those two cases
4. Use the learned weights as the embeddings
Skip-Gram Classifier
Skip-gram
(assuming a +/- 2 word window)

…lemon, a [tablespoon of apricot jam, a] pinch…


c1 c2 [target] c3 c4

Goal: train a classifier that is given a candidate (word, context) pair


(apricot, jam)
(apricot, aardvark)

And assigns each pair a probability:
P(+|w, c)
P(−|w, c) = 1 − P(+|w, c)
1 + exp ( 6.8
c · w)• W O
To turn this into a probability
unction returns a number between 0 and 1, but to make
Skip-gram
probability
We'll that word
use probability
the sigmoid c is a real
from context word for target
d the total of the twologistic
possibleregression:
events (c is
ontext word) to sum to 1. We thus estimate
1 the probabi
P(+|w, c) = s (c · w) =
ontext word for w as: 1 + exp ( c · w)
unctionP(returns
|w, c)a number betweenc)0 and 1, but to make
= 1 P(+|w,
d the total probability of the two possible1events (c is a
ontext word) to sum = tos 1.
( Wec · w) = estimate the probabil
thus
1 + exp (c · w)
ontext word for w as:
gives us the probability for one word, but there are
P( |w, c) = 1 P(+|w, c)
window. Skip-gram makes the simplifying assumption
ependent, allowing=us sto( just multiply 1
their probabilitie
c · w) =
1 + exp (c · w)
vec learns embeddings
Word2vec by starting
learns embeddings with anwith
by starting initial set of set
an initial embedding vectors
of embedding ve
nandSkip-Gram Training data
iteratively shiftingshifting
then iteratively the embedding of eachofword
the embedding eachwwordto bewmore
to be like
morethe em-
like the
Skip-gram
gsbeddings
of words ofthat
words Training
occur nearby
that in texts,inand
occur nearby lessand
texts, like thelike
less embeddings of words
the embeddings of w
n’t occur
that don’tnearby. Let’s start
occur nearby. bystart
Let’s considering a singlea single
by considering piece of training
piece data: da
of training
mon, a [tablespoon
... lemon, of apricot
a [tablespoon jam, jam, a] pinch
of apricot ... ...
a] pinch
…lemon,
c1 c1a [tablespoon
c2 tc2 oftc3apricot jam, c4a] pinch…
c3 c4
c1 c2 [target] c3 c4
s example has a target
This example has aword
targett word
(apricot), and 4 context
t (apricot), words words
and 4 context in the in
L=the±2
L=
w,window,
resultingresulting
in 4 positive trainingtraining
in 4 positive instances (on the(on
instances leftthe
below):
left below):
positivepositive
examples examples
+ + negativenegative examples
examples - -
t tc c t tc c t t c c
apricot apricot tablespoon
tablespoon apricotapricot
aardvark aardvark
apricotapricot
seven seven
apricot apricot
of of apricotapricot
my my apricotapricot foreverforever
86

apricot apricot
jam jam apricotapricot
where where apricotapricot
dear dear
apricot aapricot a apricot apricot
coaxial coaxial
apricotapricot
if if
For training
training a binarya classifier
binary classifier
we alsoweneed
alsonegative
need negative examples.
examples. In factInskip-
fact
gram
ses usesnegative
more more negative
examplesexamples than positive
than positive examples examples (with
(with the the between
ratio ratio bet
tthem
by a set by a parameter
parameter k). each
k). So for So for
ofeach
theseof(t,
these (t, c) training
c) training instances
instances we’ll c
we’ll create
k negative samples, each consisting of the target t plus a ‘noise word’. A noise
nce the ); here the first term expresses that we want the classifier to assi
imize the similarity of the target with the actual context words
context word cpos a high probability of being a neighbor, and the second
minimize thewant
ressesSkip-gram
that we similarity
to assign
Training of the
each target
of the with
noise the
words k
c negative
negi a high samp
probabi
neighbor
ng words.all multiplied because we assume independence:
a non-neighbor,
" k
#
Y
LCE = log P(+|w, cpos ) P( |w, cnegi )
i=1
" k
#
X
= log P(+|w, cpos ) + log P( |w, cnegi )
i=1
" k
#
X
= log P(+|w, cpos ) + log 1 P(+|w, cnegi )
i=1
" k
#
X
= log s (cpos · w) + log s ( cnegi · w)
i=1
t is, we want to maximize the dot product of the word with the actual c
ds, and minimize the dot products of the word with the k negative sample
the ); here the first term expressesi=1 that we want the classifier to assig
ntext word cpos "a high probability ofkbeing a neighbor, and the # second
Skip-gram X
ses that toTraining
we=want log assign
P(+|w,each
cposof
) +the noise
log 1wordsP(+|w, cnegi caneg
high) probabili
The derivatives
a non-neighbor, all multiplied
V ECTOR S EMANTICS AND
of the
because loss function
i=1we assume independence:
E MBEDDINGS
i

" " k
# #
Y
X
LCE == log log sP(+|w, cpos+)
(cpos · w) log s|w,
P( ( ccneg
negi )i · w) (6
as an exercise at the end of the chapter):
i=1
" #
k of the word with the actual con
we want ∂toLmaximize
CE
the dot product
X
and minimize = the dot =log [s
P(+|w,
(cpos c·pos
products w) +1]w
of)the logwith
word P( the
|w, ckneg )
negative sampled
∂ cpos i=1
i
r words. " #
minimize ∂this
LCEloss function using stochastic
= [s (c · w)]w X k
gradient descent. Fig.
neg
he intuition∂=cofnegonelog P(+|w,
step cpos ) +
of learning. log 1 P(+|w, cnegi )
i=1
" k
X #
∂ LCE Xk
= [s (cpos · w) 1]cpos +
aardvark [s (cnegi · w)]cnegi
∂ w log s (cpos · w)move
= + log s ( cnegi · w)
apricot and jam closer,
(
i=1
i=1
increasing cpos w
apricot w
s,pdate
weW want to maximize
equations going the
fromdot product
time step of
t the
to t +word
1 in with the actual
stochastic co
gradie
and minimize the dot products of the word with the k negative sampled
us:
Skip-gram Training
The kinds of neighbors depend on window size
Small windows (C= +/- 2) : nearest words are syntactically
similar words in same taxonomy
◦Hogwarts nearest neighbors are other fictional schools
◦Sunnydale, Evernight, Blandings
Large windows (C= +/- 5) : nearest words are related
words in same semantic field
◦Hogwarts nearest neighbors are Harry Potter world:
◦Dumbledore, half-blood, Malfoy

You might also like