0% found this document useful (0 votes)
151 views106 pages

6 Vector Apr18 2021

This document discusses different approaches to representing word meaning computationally, including using vectors and embeddings. It introduces the idea of defining word meaning based on a word's distribution and context, and representing each word as a vector in a multidimensional semantic space where similar words are located nearby. Two common methods for generating word embeddings are discussed: tf-idf, which represents words based on counts of nearby words; and Word2vec, which learns embeddings by predicting nearby words. Representing words with embeddings allows models to generalize to unseen words that are semantically similar.

Uploaded by

Rohit Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views106 pages

6 Vector Apr18 2021

This document discusses different approaches to representing word meaning computationally, including using vectors and embeddings. It introduces the idea of defining word meaning based on a word's distribution and context, and representing each word as a vector in a multidimensional semantic space where similar words are located nearby. Two common methods for generating word embeddings are discussed: tf-idf, which represents words based on counts of nearby words; and Word2vec, which learns embeddings by predicting nearby words. Representing words with embeddings allows models to generalize to unseen words that are semantically similar.

Uploaded by

Rohit Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

Word Meaning

Vector
Semantics &
Embeddings
What do words mean?
N-gram or text classification methods we've seen so far
◦ Words are just strings (or indices wi in a vocabulary list)
◦ That's not very satisfactory!
Introductory logic classes:
◦ The meaning of "dog" is DOG; cat is CAT
∀x DOG(x) ⟶ MAMMAL(x)
Old linguistics joke by Barbara Partee in 1967:
◦ Q: What's the meaning of life?
◦ A: LIFE
That seems hardly better!
Desiderata
What should a theory of word meaning do for us?
Let's look at some desiderata
From lexical semantics, the linguistic study of word
meaning
Lemmas and senses
lemma

mouse (N)
sense
1. any of numerous small rodents...
2. a hand-operated device that controls
a cursor... Modified from the online thesaurus WordNet

A sense or “concept” is the meaning component of a word


Lemmas can be polysemous (have multiple senses)
Relations between senses: Synonymy
Synonyms have the same meaning in some or all
contexts.
◦ filbert / hazelnut
◦ couch / sofa
◦ big / large
◦ automobile / car
◦ vomit / throw up
◦ water / H20
Relations between senses: Synonymy
Note that there are probably no examples of perfect
synonymy.
◦ Even if many aspects of meaning are identical
◦ Still may differ based on politeness, slang, register, genre,
etc.
Relation: Synonymy?
water/H20
"H20" in a surfing guide?
big/large
my big sister != my large sister
The Linguistic Principle of Contrast

Difference in form à difference in meaning


Abbé Gabriel Girard 1718
Re: "exact" synonyms
"

"

[I do not believe that there


is a synonymous word in any
language]

Thanks to Mark Aronoff!


Relation: Similarity
Words with similar meanings. Not synonyms, but sharing
some element of meaning

car, bicycle
cow, horse
Ask humans how similar 2 words are

word1 word2 similarity


vanish disappear 9.8
behave obey 7.3
belief impression 5.95
muscle bone 3.65
modest flexible 0.98
hole agreement 0.3

SimLex-999 dataset (Hill et al., 2015)


Relation: Word relatedness
Also called "word association"
Words can be related in any way, perhaps via a semantic
frame or field

◦ coffee, tea: similar


◦ coffee, cup: related, not similar
Semantic field
Words that
◦ cover a particular semantic domain
◦ bear structured relations with each other.

hospitals
surgeon, scalpel, nurse, anaesthetic, hospital
restaurants
waiter, menu, plate, food, menu, chef
houses
door, roof, kitchen, family, bed
Relation: Antonymy
Senses that are opposites with respect to only one
feature of meaning
Otherwise, they are very similar!
dark/light short/long fast/slow rise/fall
hot/cold up/down in/out
More formally: antonyms can
◦ define a binary opposition or be at opposite ends of a scale
◦ long/short, fast/slow
◦ Be reversives:
◦ rise/fall, up/down
Connotation (sentiment)

• Words have affective meanings


• Positive connotations (happy)
• Negative connotations (sad)
• Connotations can be subtle:
• Positive connotation: copy, replica, reproduction
• Negative connotation: fake, knockoff, forgery
• Evaluation (sentiment!)
• Positive evaluation (great, love)
• Negative evaluation (terrible, hate)
Connotation
Osgood et al. (1957)

Words seem to vary along 3 affective dimensions:


◦ valence: the pleasantness of the stimulus
◦ arousal: the intensity of emotion provoked by the stimulus
◦ dominance: the degree of control exerted by the stimulus
Word Score Word Score
Valence love 1.000 toxic 0.008
happy 1.000 nightmare 0.005
Arousal elated 0.960 mellow 0.069
frenzy 0.965 napping 0.046
Dominance powerful 0.991 weak 0.045
leadership 0.983 empty 0.081

Values from NRC VAD Lexicon (Mohammad 2018)


So far
Concepts or word senses
◦ Have a complex many-to-many association with words (homonymy,
multiple senses)
Have relations with each other
◦ Synonymy
◦ Antonymy
◦ Similarity
◦ Relatedness
◦ Connotation
Word Meaning
Vector
Semantics &
Embeddings
Vector Semantics
Vector
Semantics &
Embeddings
Computational models of word meaning

Can we build a theory of how to represent word


meaning, that accounts for at least some of the
desiderata?
We'll introduce vector semantics
The standard model in language processing!
Handles many of our goals!
Ludwig Wittgenstein

PI #43:
"The meaning of a word is its use in the language"
Let's define words by their usages
One way to define "usage":
words are defined by their environments (the words around them)

Zellig Harris (1954):


If A and B have almost identical environments we say that they
are synonyms.
What does recent English borrowing ongchoi mean?
Suppose you see these sentences:
• Ong choi is delicious sautéed with garlic.
• Ong choi is superb over rice
• Ong choi leaves with salty sauces
And you've also seen these:
• …spinach sautéed with garlic over rice
• Chard stems and leaves are delicious
• Collard greens and other salty leafy greens
Conclusion:
◦ Ongchoi is a leafy green like spinach, chard, or collard greens
◦ We could conclude this based on words like "leaves" and "delicious" and "sauteed"
Ongchoi: Ipomoea aquatica "Water Spinach"

空心菜
kangkong
rau muống

Yamaguchi, Wikimedia Commons, public domain


Idea 1: Defining meaning by linguistic distribution

Let's define the meaning of a word by its


distribution in language use, meaning its
neighboring words or grammatical environments.
Idea 2: Meaning as a point in space (Osgood et al. 1957)
3 affective dimensions for a word
◦ valence: pleasantness
◦ arousal: intensity of emotion
◦ dominance: the degree of control exerted
Word Score Word Score
Valence love 1.000 toxic 0.008
happy 1.000 nightmare 0.005
Arousal elated 0.960 mellow 0.069 NRC VAD Lexicon
frenzy 0.965 napping 0.046 (Mohammad 2018)

Dominance powerful 0.991 weak 0.045


◦ leadership 0.983 empty 0.081

Hence the connotation of a word is a vector in 3-space


Idea 1: Defining meaning by linguistic distribution

Idea 2: Meaning as a point in multidimensional space


Defining meaning as a point in space based on distribution
Each word = a vector (not just "good" or "w45")
Similar words are "nearby in semantic space"
We build this space automatically by seeing which words are
ER 6 • V ECTOR S EMANTICS AND E MBEDDINGS
nearby in text
not good
bad
to by dislike worst
’s
that now incredibly bad
are worse
a i you
than with is

very good incredibly good


amazing fantastic
terrific wonderful
nice
good
We define meaning of a word as a vector
Called an "embedding" because it's embedded into a
space (see textbook)
The standard way to represent meaning in NLP
Every modern NLP algorithm uses embeddings as
the representation of word meaning
Fine-grained model of meaning for similarity
Intuition: why vectors?
Consider sentiment analysis:
◦ With words, a feature is a word identity
◦ Feature 5: 'The previous word was "terrible"'
◦ requires exact same word to be in training and test
◦ With embeddings:
◦ Feature is a word vector
◦ 'The previous word was vector [35,22,17…]
◦ Now in the test set we might see a similar vector [34,21,14]
◦ We can generalize to similar but unseen words!!!
We'll discuss 2 kinds of embeddings
tf-idf
◦ Information Retrieval workhorse!
◦ A common baseline model
◦ Sparse vectors
◦ Words are represented by (a simple function of) the counts of nearby
words
Word2vec
◦ Dense vectors
◦ Representation is created by training a classifier to predict whether a
word is likely to appear nearby
◦ Later we'll discuss extensions called contextual embeddings
R
From now on:
Computing with meaning representations
Vector Semantics and
instead of string representations
Embeddings
C⇧@Â(| ó| ÿC Nets are for fish;
Once you get the fish, you can forget the net.
⇧@Â(✏ ó✏ ÿ Words are for meaning;
Once you get the meaning, you can forget the words
ÑP(Zhuangzi), Chapter 26

The asphalt that Los Angeles is famous for occurs mainly on its freeways. But
in the middle of the city is another patch of asphalt, the La Brea tar pits, and this
asphalt preserves millions of fossil bones from the last of the Ice Ages of the Pleis-
Vector Semantics
Vector
Semantics &
Embeddings
Words and Vectors
Vector
Semantics &
Embeddings
just so they fit on the page; in real term-document matrices, the vectors representing
each document would have dimensionality |V |, the vocabulary size.
Term-document matrix
The ordering of the numbers in a vector space indicates different meaningful di-
mensions on which documents vary. Thus the first dimension for both these vectors
corresponds to the number of times the word battle occurs, and we can compare
Eachdimension,
each documentnotingis represented
for exampleby a vector
that of words
the vectors for As You Like It and Twelfth
Night have similar values (1 and 0, respectively) 6.3for•theW ORDS
first AND V ECTORS
dimension. 7

As You
As You Like
Like It
It Twelfth Night
Twelfth Night Julius Caesar Henry V
battle 11 00 7 13
good 114
114 80
80 62 89
fool 36
36 58
58 1 4
wit 20
20 15
15 2 3
Figure 6.3
6.2 The term-document matrix for four four words
words in
in four
fourShakespeare
Shakespeareplays.
plays.Each
The cell
red
contains
boxes thethat
show number
eachof times theis(row)
document word occurs
represented in the (column)
as a column vector of document.
length four.

represented as a count
We can think of thevector,
vectoraforcolumn in Fig. as
a document 6.3.a point in |V |-dimensional space;
vector thusTo
thereview someinbasic
documents linear
Fig. 6.3 arealgebra, vector is, at heart,
points ina4-dimensional space.just a list
Since or array of
4-dimensional
numbers.
spaces are So AstoYou
hard Like It is
visualize, represented
Fig. 6.4 showsas the list [1,114,36,20]
a visualization (the first column
in two dimensions; we’ve
spaces are hard to visualize, Fig. 6.4 shows a visualization in two dimensions; we’v
Visualizing document vectors
arbitrarily chosen the dimensions corresponding to the words battle and fool.

40
Henry V [4,13]
15
battle

10 Julius Caesar [1,7]

5 As You Like It [36,1] Twelfth Night [58,0]

5 10 15 20 25 30 35 40 45 50 55 60
fool
Figure 6.4 A spatial visualization of the document vectors for the four Shakespeare pl
mensions on which documents vary. Thus the first dimension for both these vectors
corresponds to the number of times the word battle occurs, and we can compare
Vectors are the basis of information retrieval
each dimension, noting for example that the vectors for As You Like It and Twelfth
Night have similar values (1 and 0, respectively) for the first dimension.

As You Like It Twelfth Night Julius Caesar Henry V


battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
Figure 6.3 The term-document matrix for four words in four Shakespeare plays. The red
boxes show that each document is represented as a column vector of length four.
Vectors are similar for the two comedies
We can think of the vector for a document as a point in |V |-dimensional space;
thus the documents in Fig. 6.3 are points in 4-dimensional space. Since 4-dimensional
But comedies
spaces are different
are hard to visualize, Fig. 6.4 showsthan the other
a visualization in twotwo
dimensions; we’ve
arbitrarily chosen the dimensions corresponding to the words battle and fool.
Comedies have more fools and wit and fewer battles.

40
vector, hence with different dimensions, as shown in Fig. 6.5. The four dimensions
Idea for word meaning: Words can be vectors too!!!
of the vector for fool, [36,58,1,4], correspond to the four Shakespeare plays. Word
counts in the same four dimensions are used to form the vectors for the other 3
words: wit, [20,15,2,3]; battle, [1,0,7,13]; and6.3 W ORDS AND V ECTORS 7
good• [114,80,62,89].

As You
As You Like
Like It
It Twelfth Night
Twelfth Night JuliusCaesar
Julius Caesar HenryVV
Henry
battle
battle 11 00 77 13
13
good 114
114 80
80 62
62 89
89
fool 36
36 58
58 11 44
wit 20
20 15
15 22 33
Figure
Figure 6.5
6.2 The term-document matrixmatrix for
for four
fourwords
wordsininfour
fourShakespeare
Shakespeareplays.
plays.Each
Thecell
red
boxes show
contains thethat eachof
number word is the
times represented as aoccurs
(row) word row vector
in theof length four.
(column) document.
battle is "the kind of word that occurs in Julius Caesar and Henry V"
For documents,
represented we vector,
as a count saw thata similar
columndocuments
in Fig. 6.3.had similar vectors, because sim-
tor ilar documents
foolTo isreview tend
some
"the kind toword
ofbasichave similar
linear
that words.
algebra,
occurs This same
ainvector principle
is, at heart,
comedies, justapplies
especially list ortoarray
aTwelfth words:
of
Night"
similar
numbers. words have
So As Yousimilar
Like Itvectors becauseasthey
is represented the tend to occur in similar
list [1,114,36,20] documents.
(the first column
The
vectorterm-document matrix
in Fig. 6.3) and thusCaesar
Julius lets usisrepresent
representedthe meaning
as the listof[7,62,1,2]
a word by(thethethird
doc-
ace uments
column itvector). vector
tends toAoccur in. space is a collection of vectors, characterized by their
times the
dow around therow (target)
word, word and of
for example the4column
words (context)
to the leftword
andco-occur
4 wordsintosome context
the right,
in some
case training
the cell corpus. Thethe
context could
of be the document, in whichcorpus)
case thethe
cell
More common: word-word matrix
in which
column represents
word occurs
represents
the number
in suchof times
a ±4
number
the window
word two words
times (in some training
appear
around theinrow
the word.
same document.
For example It is

(or "term-context
here ismost common,each
one example
matrix")
however, to use
of some
word, for example of 4 words
smaller
words contexts,
in their generally a window around the
windows:
to the left and 4 words to the right, in which case
is cell
the traditionally
representsfollowed by of
the number cherry pie,training
times (in some a traditional
corpus)dessert
the column word
occurs inoften
suchmixed,
a ±4 word as strawberry
suchwindow around the rhubarb
row word. pie.For
Apple pie here is one
example
Two words
computer are
eachsimilar
peripherals
example of and in meaning
somepersonal
words in digital if their
their windows: context
assistants. vectors
These devicesare similar
usually
a computer. This includes information available on the internet
If we thenistake
traditionally followed by
every occurrence cherry
of each word (saypie, a traditionaland
strawberry) dessert
count the
often mixed, such as strawberry rhubarb pie. Apple pie
context words around it, we get a word-word co-occurrence matrix. Fig. 6.6 shows a
computer peripherals and personal digital assistants. These devices usually
simplified subset of the word-word co-occurrence matrix for these four words com-
a computer. This includes information available on the internet
puted from the Wikipedia corpus (Davies, 2015).
If we then take every occurrence of each word (say strawberry) and count the con-
text wordsaardvark
around it, we
... getcomputer
a word-worddata
co-occurrence
result matrix.
pie Fig. 6.5 shows
sugar ... a
simplified subset
cherry 0 of the...word-word2 co-occurrence
8 matrix
9 for 442
these four25words...com-
puted from the0 Wikipedia
strawberry ... corpus0(Davies, 2015).
0 1 60 19 ...
digital Note in Fig.
0 6.5 that
... the two
1670words cherry
1683 and strawberry
85 5 are more4 similar
... to
each other (both
information 0 pie and... sugar3325
tend to occur in their378
3982 window) 5than they
13are to...other
4000
information

computer
3000 [3982,3325]
digital
2000 [1683,1670]

1000

1000 2000 3000 4000


data
Words and Vectors
Vector
Semantics &
Embeddings
Cosine for computing word similarity
Vector
Semantics &
Embeddings
hence of length |V |, or both with documents as dimensions as documents, of length
|D|) and gives a measure of their similarity. By far the most common similarity
metricComputing word
is the cosine of the anglesimilarity: Dot product and cosine
between the vectors.
The cosine—like most measures for vector similarity used in NLP—is based on
the dot
Theproduct
dot product between
operator from two also
linear algebra, vectors theainner
called is product:
scalar:
N
X
dot product(v, w) = v · w = vi wi = v1 w1 + v2 w2 + ... + vN wN (6.7)
i=1

As weThewilldot
see,product tends
most metrics to be high
for similarity when
between thearetwo
vectors based on the dot
vectors
product. The dothave large
product actsvalues in themetric
as a similarity same dimensions
because it will tend to be high
just when the two vectors have large values in the same dimensions. Alternatively,
Dotthatproduct
vectors have zeroscan thus bedimensions—orthogonal
in different a useful similarity vectors—will
metric have a
between
dot product vectors their strong dissimilarity.
of 0, representing
This raw dot product, however, has a problem as a similarity metric: it favors
long vectors. The vector length is defined as
will see, most metrics for similarity between vectors are based on the dot
ct. The dot product acts as a similarity metric because it will tend to be high
Problem with raw dot-product
hen the two vectors have large values in the same dimensions. Alternatively,
s that have zeros in different dimensions—orthogonal vectors—will have a
oduct ofDot product favors
0, representing long vectors
their strong dissimilarity.
is raw dot product, however, has a problem as a similarity
Dot product is higher if a vector is longer (has higher metric: it favors
ectors. The vector
values lengthdimension)
in many is defined as
v
Vector length: u N
uX
|v| = t v2i (6.8)
i=1

Frequent
ot product is higher words (of,isthe,
if a vector you)
longer, have
with long
higher vectors
values (since
in each dimension.
frequentthey
wordsoccur
havemany
longertimes with
vectors, other
since theywords).
tend to co-occur with more
and haveSohigher co-occurrence
dot product overly values withfrequent
favors each of them.
wordsThe raw dot product
ill be higher for frequent words. But this is a problem; we’d like a similarity
This raw dot product, however, has a problem as a similarity metric: it favors
ector length long similarity
The cosine vector length
vectors. Themetric is defined
between as
two vectors ~v and ~w thus can be computed
s: Alternative: cosine for computing word similarity
v
u N
uX
|v| = t v2i (6.8)
i=1
N
X
vi wi
The dot product is higher if a vector
~v · ~wis longer, with higher values in each dimension.
i=1
cosine(~
More frequent words v,~
w) =longer vectors,
have = vsince they vtend to co-occur with more (6.10)
|~v||~w| values
words and have higher co-occurrence uwith
N eachuof Nthem. The raw dot product
uX uX
thus will be higher for frequent words. But tthis isva2 t 2
problem;wwe’d like a similarity
i i
metric that tells us how similar two words are regardless of their frequency.
i=1 i=1
We modify the dot product to normalize for the vector length by dividing the
For somedotapplications
product by the we pre-normalize
lengths of each of the twoeach vector,
vectors. bynormalized
This dividing it dotby its length,
product
turns vector
reating a unit out to be of
the length
same as 1.
the Thus
cosine ofwe thecould
angle between
compute theatwo vectors,
unit vectorfollowing
from ~ab by
Based on the definition of the dot product between
from the definition of the dot product between two vectors a and b: two vectors a and
ividing it by |~a|. For unit vectors, the dot product is the same as the cosine.
The cosine value ranges from 1 for avectors pointing
· b = |a||b| cos q in the same direction, through
for vectors that are orthogonal, to a-1· bfor=vectors
cos q pointing in opposite directions.
(6.9)
But raw frequency values are non-negative,
|a||b| so the cosine for these vectors ranges
Cosine as a similarity metric

-1: vectors point in opposite directions


+1: vectors point in same directions
0: vectors are orthogonal

But since raw frequency values are non-negative, the


46

cosine for term-term matrix vectors ranges from 0–1


0 for vectors that are orthogonal, to -1 for vectors
t raw frequency values are non-negative, so the cosine for these vectors ranges pointing in opposite direction
ctors that But Let’s
are
raw see how the
orthogonal,
frequency tocosine
-1 for
values are computes
vectors pointing
non-negative, which so ofthe
thecosine
in wordsfor
opposite cherry orvectors
directions.
these digitalrang
is c
Cosine examples
m 0–1. in meaning to information, just using raw counts from the following shortened t
frequencyfromvalues
0–1. are non-negative, so the cosine for these vectors ranges
Let’s see how the cosine computes which of the words cherry or digital is closer
1. Let’s see how the cosine computes which pie of thecomputer
data words cherry or digital is clos
meaning to information, just using raw counts from the following shortened table:
see how in the
meaning
cosineto
   
information,
computes N just
which using
cherryof theraw counts
words
442 8 from
cherry the
or
pie
2 following
digital
data is shortened
closer
computer tabl
ng tocos(   v •w v w
information,
v, w) =   = just
∑ vi wi
pie data computer
 •  using
= rawi=1
counts
digital frompie the
5cherry
data following
1683 1670
computer442 shortened
8 2 table:
v w v w cherry N 2 N
∑i=1 i 84425digital
∑i=1 i cherry
v 442
information w 2
2
83982 23325
5 1683 1670
digitalpiedigitaldata
5 1683 computer 1670 1670
5 information
1683
cherry 442 8 2 5 3982 3325
information information 5 3982 442 5 5 3325
⇤3982 + 8 ⇤ 39823325 + 2 ⇤ 3325
cos(cherry,digital
information)5 =1683 p 1670 p = .017
4422 + 82 + 22 52 + 39822 + 33252
information 5 442 3982
⇤5+ 442
5 8⇤3325
⇤553982
+1683
+ 2⇤ ⇤3982
8 ⇤+3982 3325
+ 2+ ⇤ 3325
1670 ⇤ 3325
os(cherry, information)
cos(cherry,
cos(digital, information)
= p == pp
information) p p p = .017 = .017 = .
4422 + 8442 2+
5 2 2+
+ 2 282 5
1683 + 222+2 3982
+ 5 2 +2
1670
23982
+ 5 2 2 + 23325
3325
+ 3982 2 2
+ 3325 2
442 ⇤ 5 + 8 ⇤ 53982 2 ⇤ 3325
+1683
erry, information) = p 5 ⇤p 5 +p 1683 ⇤ 3982⇤+3982
⇤ 5 + 1670 +⇤1670
3325
= ⇤ 3325
.017
os(digital, cos(digital,
information)
The model information)
decides
= p2 that =2information
2 2 is way p 2 closerp to 2digital than it = is = cher
to
.996 .996
442 + 5 2 8+ + 2
16835 2 2+51683
+ +
1670 3982
2+ 2 1670
5 +
2 +233255 2 +239822 + 3325
3982 + 3325 2 2
result that seems sensible.
47
5⇤5+ Fig.
1683 6.7⇤shows
3982 + a visualization.
1670 ⇤ 3325
ital, information)
The model= p
decides
The model decides that information that information
is way p
is way
closer closer
to to
digital digital
than =
than
it is .996
it
to is to
cherry,cherry,
a
5 2 + 16832 + 16702 52 + 39822 + 33252
ult that result
seemsthat seems sensible.
sensible. Fig. 6.7 Fig.
shows 6.7ashows a visualization.
visualization.
Visualizing cosines
• V ECTOR S EMANTICS AND E MBEDDINGS
(well, angles)
Dimension 1: ‘pie’

500
cherry
digital information

500 1000 1500 2000 2500 3000

Dimension 2: ‘computer’
ure 6.7 A (rough) graphical demonstration of cosine similarity, showing vec
Cosine for computing word
Vector similarity
Semantics &
Embeddings
TF-IDF
Vector
Semantics &
Embeddings
But raw frequency is a bad representation

• The co-occurrence matrices we have seen represent each


cell by word frequencies.
• Frequency is clearly useful; if sugar appears a lot near
apricot, that's useful information.
• But overly frequent words like the, it, or they are not very
informative about the context
• It's a paradox! How can we balance these two conflicting
constraints?
fool 36 0.012
good 37 0
Two common solutions for word weighting
sweet 37 0
eighting of the value for word t in document d, wt,d thus combines
with idf:tf-idf: tf-idf value for word t in document d:

wt,d = tft,d ⇥ idft (6.13)

f-idf weighting
Wordsto thelikeShakespeare
"the" or "it" term-document
have very low idfmatrix in Fig. 6.2.
idf values for the dimension corresponding to the word good have
0; sincePMI: (Pointwise
this word appears mutual information)
in every document, the tf-idf algorithm
𝒑(𝒘𝟏 ,𝒘𝟐 )
ored in any comparison
◦ PMI 𝒘𝟏 , 𝒘𝟐 of=the 𝒍𝒐𝒈plays. Similarly, the word fool, which
𝒑 𝒘𝟏 𝒑(𝒘𝟐 )
t of the 37 plays, has a much lower weight.
eighting is bySeefar ifthe dominant
words way of
like "good" weighting
appear more co-occurrence ma-than
often with "great"
tion retrieval,webut also expect
would plays abyrole in many other aspects of natural
chance
f Shakespeare’s favorite adjectives, a fact probably related to the increased use of
Term frequency (tf)
tft,d = count(t,d)

Instead of using raw count, we squash a bit:

tft,d = log10(count(t,d)+1)
for discriminating those documents from the rest of the collection; terms that occur
nt
frequently across the entire collection aren’t as helpful. The document frequency
y
Document frequency (df)
dft of a term t is the number of documents it occurs in. Document frequency is
not the same as the collection frequency of a term, which is the total number of
times the word appears in the whole collection in any document. Consider in the
df t is the number of documents t occurs in.
collection of Shakespeare’s 37 plays the two words Romeo and action. The words
have(note
identical collection frequencies (they both occur 113 times
this is not collection frequency: total count across in all the plays) but
very different document frequencies, since Romeo only occurs in a single play. If
all documents)
our goal is to find documents about the romantic tribulations of Romeo, the word
"Romeo"
Romeo should beishighly
veryweighted,
distinctive
but notfor one Shakespeare play:
action:
Collection Frequency Document Frequency
Romeo 113 1
action 113 31
We emphasize discriminative words like Romeo via the inverse document fre-
df quency or idf term weight (Sparck Jones, 1972). The idf is defined using the frac-
tion N/df , where N is the total number of documents in the collection, and df is
umber of documents
common as toin many collections,
be completely this measure
non-discriminative since they o
Inverse
gooddocument
or sweet. 3 frequency (idf)
th a log function. The resulting definition for inverse
thus Word df idf
Romeo 1 1.57
✓ ◆ salad 2 1.27
N
idft = log10 Falstaff 4 (6.13)
0.967
dft forest 12 0.489
battle 21 0.246
or some words in the Shakespeare
N is the total number of documents corpus,
wit ranging
34 0.037 from
s which fool 36 0.012
occur in only one play like Romeo, to those that
in the collection
good 37 0
Falstaff, to those which are very common
sweet like
37 fool
0 or so
y non-discriminative since they occur in all 37 plays like
What is a document?

Could be a play or a Wikipedia article


But for the purposes of tf-idf, documents can be
anything; we often call each paragraph a document!
(defined either by Eq. good 6.11
sweet
37 or
37 0
0 by Eq. 6.12) with id

Final
tf-idf
tf-idf weighted value for a word
The tf-idf weighted value w for word t in document d thus combines term
t, d
t, d
wt, d = tft, d6.3⇥•idf
frequency tf (defined either by Eq. 6.11 or by Eq. 6.12) with idf from Eq. 6.13:
Wt V
ORDS AND 7 ECTORS
Raw counts: wt, d = tft, d ⇥ idft (6.14)
As You Like It Twelfth Night Julius Caesar Henry V
battle 1 weighting to the Shakespeare
0 7 13Fig. 6.2,
tf-idf weighting to the Shakespeare term-documen
Fig. 6.9 applies tf-idf
good the tf equation
using 114 Eq. 6.12. Note 80
term-document
that the tf-idf values
matrix in
62for the dimension
89 corre-
fool 36 58 1 4
uation Eq. 6.12. Note that the tf-idf values for the
sponding to the word good have now all become 0; since this word appears in every
wit 20 15 2 3
document, the tf-idf algorithm leads it to be ignored. Similarly, the word fool, which
Figure 6.2 The term-document matrix for four words in four Shakespeare plays. Each cell
wordtf-idf:
good have now all become 0; since this wor
appears
contains in
the36 out ofofthe
number 37 the
times plays, has
(row) a much
word occurslower
in theweight.
(column) document.

f-idf battle
algorithm
represented
vector
as a countleads
As You Like It
0.074 it to
vector, a column
0 be
Twelfth Night
in Fig. ignored.
6.3. Julius Caesar
0.22 Similarly,
Henry V
0.28
To review some basic linear algebra, a vector is, at heart, just a list or array of
th
good 0 0 0 0
ut of numbers.
the
fool 37
0.019plays, has
So 0.021 a much
As You Like It is represented as the list lower
[1,114,36,20]
0.0036 weight.first column
(the0.0083
vector in Fig. 6.3) and Julius Caesar is represented as the list [7,62,1,2] (the third
wit 0.049 0.044 0.018 0.022
or space column6.9vector).
Figure A tf-idf vector space
A weighted is a collection
term-document matrix of
forvectors, characterized
four words by their
in four Shakespeare
TF-IDF
Vector
Semantics &
Embeddings
PPMI
Vector
Semantics &
Embeddings
Pointwise Mutual Information

Pointwise mutual information:


Do events x and y co-occur more than if they were independent?
P(x,y)
PMI(X,Y ) = log 2
P(x)P(y)
PMI between two words: (Church & Hanks 1989)
Do words x and y co-occur more than if they were independent?

𝑃(𝑤𝑜𝑟𝑑! , 𝑤𝑜𝑟𝑑" )
PMI 𝑤𝑜𝑟𝑑! , 𝑤𝑜𝑟𝑑" = log "
𝑃 𝑤𝑜𝑟𝑑! 𝑃(𝑤𝑜𝑟𝑑" )
Positive Pointwise Mutual Information
◦ PMI ranges from −∞ to + ∞
◦ But the negative values are problematic
◦ Things are co-occurring less than we expect by chance
◦ Unreliable without enormous corpora
◦ Imagine w1 and w2 whose probability is each 10-6
◦ Hard to be sure p(w1,w2) is significantly different than 10-12
◦ Plus it’s not clear people are good at “unrelatedness”
◦ So we just replace negative PMI values by 0
◦ Positive PMI (PPMI) between word1 and word2:
𝑃(𝑤𝑜𝑟𝑑! , 𝑤𝑜𝑟𝑑" )
PPMI 𝑤𝑜𝑟𝑑! , 𝑤𝑜𝑟𝑑" = max log " ,0
𝑃 𝑤𝑜𝑟𝑑! 𝑃(𝑤𝑜𝑟𝑑" )
context c j . This can be turned into a PPMI matrix where ppmii j gives the PPMI
value of word wi with context c j as follows:

Computing PPMI on a term-context


p = p =
matrixp = ij PW
fi j
PC
fi j
i⇤
PC
PW PC
j=1 fi j
fi j
⇤j
PW
PW i=1
PC
fi j
fi j
(6.19)
i=1 j=1 i=1 j=1 i=1 j=1

Matrix F with W rows (words) and C columns (contexts) PPMIi j = max(log2


pi j
pi⇤ p⇤ j
, 0) (6.20)

fij is # of times wi occurs in context


count marginals, c
Let’s see some PPMI calculations. We’ll use Fig. 6.10, which repeats Fig. 6.6 plus
all the j let’s pretend for ease of calculation that these are the
and
only words/contexts that matter.

C W computer data result pie sugar count(w)

fij ∑ fij ∑ fij cherry


strawberry
2
0
8
0
9
1
442
60
25
19
486
80
pij = W C pi* = j=1 p* j = i=1
digital 1670 1683 85 5 4 3447
W C W C
information 3325 3982 378 5 13 7703
∑∑ fij ∑∑ fij ∑∑ fij
i=1 j=1 i=1 j=1 i=1 j=1 count(context) 4997 5673 473 512 61 11716
Figure 6.10 Co-occurrence counts for four words in 5 contexts in the Wikipedia corpus,
together with the marginals, pretending for the purpose of this calculation that no other
words/contexts matter.
pij !# pmi if pmi > 0
ijfor example we ijcould compute PPMI(w=information,c=data), assuming
pmiij = log 2 ppmiij = " we pretended that Fig. 6.6 encompassed all the relevant word contexts/dimensions,
Thus
pi* p* j 0
#$ as follows: otherwise
3982
P(w=information,c=data) = = .3399
11716 62
we pretended that Fig. 6.6 encompassed all the relevant word contexts/dimensions,
as follows: computer data result pie sugar count(w)
cherry 2 8 9 442 25 486
fij strawberry 3982
0 0 1 60 19 80
pij = P(w=information,c=data)
digital = 1670 = .3399
1683 85 5 4 3447
W C 11716
information 3325 3982 378 5 13 7703
7703
∑∑ fij P(w=information) =
count(context) 11716
4997
= .6575
5673 473 512 61 11716
i=1 j=1 5673 counts for four words in 5 contexts in the Wikipedia corpus,
Figure 6.10 Co-occurrence
P(c=data)
together with the =
marginals, = .4842 for the purpose of this calculation that no other
pretending
11716
words/contexts matter.
ppmi(information,data) = log 2(.3399/(.6575 C ⇤ .4842)) = .0944 W
p(w=information,c=data) = 3982/111716 = .3399 ∑
Thus for example we could computefijPPMI(w=information,c=data),∑ fij assuming
Fig. 6.11
p(w=information) shows the we
= 7703/11716 joint probabilities
= pretended
.6575 that Fig.computed fromj=1
6.6 encompassed the
allcounts
the in Fig.
relevant word 6.10, and
contexts/dimensions,
i=1
Fig. 6.12 shows the PPMI values. Not p(wi )cherry
surprisingly, = and p(c j ) =are highly
strawberry
p(c=data) = 5673/11716 = .4842 as follows: N N
associated with both pie and sugar, and data is mildly associated with information.
3982
P(w=information,c=data) = = .3399
p(w,context) 11716 p(w)
computer data result pie 7703 sugar p(w)
P(w=information) = = .6575
cherry 0.0002 0.0007 0.0008 0.0377117160.0021 0.0415
5673
strawberry 0.0000 0.0000 0.0001
P(c=data)0.0051
= 0.0016
= .4842 0.0068
digital 0.1425 0.1436 0.0073 0.0004117160.0003 0.2942
information 0.2838 ppmi(information,data)
0.3399 0.0323 = log 2(.3399/(.6575
0.0004 0.0011 ⇤ .4842))
0.6575 = .0944
Fig. 6.11 shows the joint probabilities computed from the counts in Fig. 6.10, and
p(context) 0.4265 0.4842 0.0404 0.0437 0.0052
Fig. 6.12 shows the PPMI values. Not surprisingly, cherry and strawberry are
63 highly
p(w,context) p(w)
computer data result pie sugar p(w)
cherry 0.0002 0.0007 0.0008 0.0377 0.0021 0.0415
pij strawberry 0.0000 0.0000 0.0001 0.0051 0.0016 0.0068
pmiij = log 2 digital 0.1425 0.1436 0.0073 0.0004 0.0003 0.2942
pi* p* j information 0.2838 0.3399 0.0323 0.0004 0.0011 0.6575

p(context) 0.4265 0.4842 0.0404 0.0437 0.0052


Figure 6.11 Replacing the counts in Fig. 6.6 with joint probabilities, showing the marginals
around the outside.

PMI has the problem of being biased toward infrequent events; very rare words
pmi(information,data)tend
= tolog .3399
2 (very
have
/ (.6575*.4842) ) = .0944
high PMI values. One way to reduce this bias toward low frequency
HAPTER 6 • V ECTOR S EMANTICS AND E MBEDDINGS
Resulting PPMI matrix (negatives replaced by 0)
computer data result pie sugar
cherry 0 0 0 4.38 3.30
strawberry 0 0 0 4.10 5.51
digital 0.18 0.01 0 0 0
information 0.02 0.09 0.28 0 0
Figure 6.12 The PPMI matrix showing the association between words and context words,
64
Weighting PMI
PMI is biased toward infrequent events
◦ Very rare words have very high PMI values
Two solutions:
◦ Give rare words slightly higher probabilities
◦ Use add-one smoothing (which has a similar effect)

65
Weighting PMI: Giving rare context words slightly
PMI has the problem of being biased toward infrequent events; very rare words
higher probability
tend to have very high PMI values. One way to reduce this bias toward low frequency
events is to slightly change the computation for P(c), using a different function Pa (c)
that raisesRaise
contexts to the
the power of
context a (Levy et al.,to
probabilities 2015):
𝛼 = 0.75:
P(w, c)
PPMIa (w, c) = max(log2 , 0) (19.8)
P(w)Pa (c)

count(c)a
Pa (c) = P a
(19.9)
c count(c)
Levy et al. (2015) found that a setting of a = 0.75 improved performance of
This helps because 𝑃 # 𝑐 > 𝑃 𝑐 for rare c
embeddings on a wide range of tasks (drawing on a similar weighting used for skip-
Consider
grams (Mikolov et al.,two events,
2013a) P(a) (Pennington
and GloVe = .99 and P(b)=.01
et al., 2014)). This works
because raising the probability to."#a = 0.75 increases the probability."#assigned to rare
contexts, and 𝑃
hence lowers .%%
their PMI (P= > P(c) when .'!
c is rare).
# 𝑎 = ."# ."# .97
a (c) 𝑃# 𝑏 = ."# ."# = .03
.%% is&.'!
Another possible solution .'! &.'! PMI, a small
Laplace smoothing: Before computing
66
Word2vec
Vector
Semantics &
Embeddings
Sparse versus dense vectors

tf-idf (or PMI) vectors are


◦ long (length |V|= 20,000 to 50,000)
◦ sparse (most elements are zero)
Alternative: learn vectors which are
◦ short (length 50-1000)
◦ dense (most elements are non-zero)
Sparse versus dense vectors
Why dense vectors?
◦ Short vectors may be easier to use as features in machine
learning (fewer weights to tune)
◦ Dense vectors may generalize better than explicit counts
◦ Dense vectors may do better at capturing synonymy:
◦ car and automobile are synonyms; but are distinct dimensions
◦ a word with car as a neighbor and a word with automobile as a
neighbor should be similar, but aren't
◦ In practice, they work better
69
Common methods for getting short dense vectors

“Neural Language Model”-inspired models


◦ Word2vec (skipgram, CBOW), GloVe
Singular Value Decomposition (SVD)
◦ A special case of this is called LSA – Latent Semantic Analysis
Alternative to these "static embeddings":
• Contextual Embeddings (ELMo, BERT)
• Compute distinct embeddings for a word in its context
• Separate embeddings for each token of a word
Simple static embeddings you can download!

Word2vec (Mikolov et al)


https://code.google.com/archive/p/word2vec/

GloVe (Pennington, Socher, Manning)


http://nlp.stanford.edu/projects/glove/
Word2vec
Popular embedding method
Very fast to train
Code available on the web
Idea: predict rather than count
Word2vec provides various options. We'll do:
skip-gram with negative sampling (SGNS)
Word2vec
Instead of counting how often each word w occurs near "apricot"
◦ Train a classifier on a binary prediction task:
◦ Is w likely to show up near "apricot"?
We don’t actually care about this task
◦ But we'll take the learned classifier weights as the word embeddings
Big idea: self-supervision:
◦ A word c that occurs near apricot in the corpus cats as the gold "correct
answer" for supervised learning
◦ No need for human labels
◦ Bengio et al. (2003); Collobert et al. (2011)
Approach: predict if candidate word c is a "neighbor"
1. Treat the target word t and a neighboring context word c
as positive examples.
2. Randomly sample other words in the lexicon to get
negative examples
3. Use logistic regression to train a classifier to distinguish
those two cases
4. Use the learned weights as the embeddings
Skip-Gram Training Data

Assume a +/- 2 word window, given training sentence:

…lemon, a [tablespoon of apricot jam, a] pinch…


c1 c2 [target] c3 c4
Skip-Gram Classifier
(assuming a +/- 2 word window)

…lemon, a [tablespoon of apricot jam, a] pinch…


c1 c2 [target] c3 c4

Goal: train a classifier that is given a candidate (word, context) pair


(apricot, jam)
(apricot, aardvark)

And assigns each pair a probability:
P(+|w, c)
P(−|w, c) = 1 − P(+|w, c)
Similarity is computed from dot product
Remember: two vectors are similar if they have a high
dot product
◦ Cosine is just a normalized dot product
So:
◦ Similarity(w,c) ∝ w · c
We’ll need to normalize to get a probability
◦ (cosine isn't a probability either)
77
odel theTurning
probabilitydot
thatproducts into probabilities
word c is a real context word for target word w
1
Sim(w,c) ≈ wc)· c= s (c · w) = 1 + exp ( 6.8
P(+|w,
c · w)• W ORD 2 VEC
To turn this into a probability
gmoid function returns a number between 0 and 1, but to make it a prob
model theWe'll
probability that word
use probability
the sigmoidc is a real
from context word for target word w
also need the total of the twologistic
possibleregression:
events (c is a context
isn’t a context word) to sum to 1. We thus estimate
1 the probability that
P(+|w, c) = s (c · w) =
a real context word for w as: 1 + exp ( c · w)
igmoid functionP(returns
|w, c)a number betweenc)0 and 1, but to make it a proba
= 1 P(+|w,
also need the total probability of the two possible1events (c is a context
isn’t a context word) to sum= tos 1.
( Wec · w) = estimate the probability that w
thus
1 + exp (c · w)
6.8 • W ORD 2 VEC 19
P( |w, c) = 1 P(+|w, c)
How Skip-Gram Classifier computes
1 P(+|w,
We model the probability that word c is a real context word for target word w as:
c)
= s ( c · w) = (6.29)
11 + exp (c · w)
P(+|w, c) = s (c · w) = (6.28)
1 + exp ( c · w)
uation 6.28 gives us the probability for one word, but there are many context
This is function
The sigmoid for onereturns
context word,between
a number but we have
0 and lots
1, but of context
to make words.
it a probability
rds inwe’ll
theWe'll
window.
also need the Skip-gram
total makes
probability of the the
two simplifying
assume independence and just multiply them:
possible events assumption
(c is a context that
word, all context
rds are
andindependent, allowing
c isn’t a context word) to sumus toWe
to 1. just multiply
thus their
estimate the probabilities:
probability that word c
is not a real context word for w as:
Y L
P( |w, c) = 1 P(+|w, c)
P(+|w, c1:L ) = s 1(ci · w) (6.30)
= s ( c · w) = (6.29)
i=1
1 + exp (c · w)
X L
Equation 6.28 gives us the probability for one word, but there are many context
words in the window.log P(+|w,makes
Skip-gram c1:L )the=simplifying s (ci · w)that all context
logassumption (6.31)
words are independent, allowing us to just multiply
i=1their probabilities:
L
Y
summary, skip-gram trains a probabilistic classifier that, given a test target word
Skip-gram classifier: summary
A probabilistic classifier, given
• a test target word w
• its context window of L words c1:L
Estimates probability that w occurs in this window based
on similarity of w (embeddings) to c1:L (embeddings).

To compute this, we just need embeddings for all the


words.
These embeddings we'll need: a set for w, a set for c
1..d
aardvark 1

apricot

… … W target words

zebra |V|
&= aardvark |V|+1
apricot

C context & noise


… …
words
zebra 2V
Word2vec
Vector
Semantics &
Embeddings
Word2vec: Learning the
Vector embeddings
Semantics &
Embeddings
Word2vec learns embeddings by starting with an initial set of embedding vecto
Skip-Gram Training data
and then iteratively shifting the embedding of each word w to be more like the em
beddings of words that occur nearby in texts, and less like the embeddings of word
that don’t occur nearby. Let’s start by considering a single piece of training data:
... lemon, a [tablespoon of apricot jam, a] pinch ...
…lemon,c1a [tablespoon
c2 oft apricot
c3 jam, a] pinch…
c4
c1 c2 [target] c3 c4
This example has a target word t (apricot), and 4 context words in the L = ±
window, resulting in 4 positive training instances (on the left below):
positive examples + negative examples -
t c t c t c
apricot tablespoon apricot aardvark apricot seven
apricot of apricot my apricot forever
84
apricot jam apricot where apricot dear
apricot a apricot coaxial apricot if
Word2vec learns embeddings by starting with an initial set of embedding vecto
Skip-Gram Training data
and then iteratively shifting the embedding of each word w to be more like the em
beddings of words that occur nearby in texts, and less like the embeddings of word
that don’t occur nearby. Let’s start by considering a single piece of training data:
... lemon, a [tablespoon of apricot jam, a] pinch ...
…lemon,c1a [tablespoon
c2 oft apricot
c3 jam, a] pinch…
c4
c1 c2 [target] c3 c4
This example has a target word t (apricot), and 4 context words in the L = ±
window, resulting in 4 positive training instances (on the left below):
positive examples + negative examples -
t c For each
t positive
c t c
apricot tablespoon example
apricotwe'll grab kapricot seven
aardvark
apricot of negative
apricotexamples,
my apricot forever
sampling
apricotbywhere
frequency
85
apricot jam apricot dear
apricot a apricot coaxial apricot if
Word2vec learns embeddings
Word2vec by starting
learns embeddings with anwith
by starting initial set of set
an initial embedding vectors
of embedding vecto
nd thenandSkip-Gram Training data
iteratively shiftingshifting
then iteratively the embedding of eachofword
the embedding eachwwordto bewmore
to be like
morethe em-
like the em
eddingsbeddings
of wordsofthat occur
words thatnearby in texts,inand
occur nearby lessand
texts, like thelike
less embeddings of words
the embeddings of word
hat don’t occur
that don’tnearby. Let’s start
occur nearby. bystart
Let’s considering a singlea single
by considering piece of training
piece data: data:
of training
.. lemon, a [tablespoon
... lemon, of apricot
a [tablespoon jam, jam, a] pinch
of apricot ... ...
a] pinch
…lemon,
c1 c1a [tablespoon
c2 tc2 oftc3apricot jam, c4a] pinch…
c3 c4
c1 c2 [target] c3 c4
This example has a target
This example has aword
targett word
(apricot), and 4 context
t (apricot), words words
and 4 context in the in
L=the±2
L=±
window,window,
resultingresulting
in 4 positive trainingtraining
in 4 positive instances (on the(on
instances leftthe
below):
left below):
positivepositive
examplesexamples
+ + negative
negative examples
examples - -
t tc c t tc c t t c c
apricot apricot tablespoon
tablespoon apricotapricot aardvark
aardvark apricotapricot
seven seven
apricot apricot
of of apricotapricot
my my apricotapricot foreverforever
86

apricot apricot
jam jam apricotapricot
where whereapricotapricot
dear dear
apricot apricot
a a apricotapricot
coaxialcoaxial
apricotapricot
if if
Word2vec: how to learn vectors
Given the set of positive and negative training instances,
and an initial set of embedding vectors
The goal of learning is to adjust those word vectors such
that we:
◦ Maximize the similarity of the target word, context word pairs
(w , cpos) drawn from the positive data
◦ Minimize the similarity of the (w , cneg) pairs drawn from the
negative data.

4/18/21 87
• Minimize the similarity of the (w, cneg ) pairs from the negative examples.
If we consider one word/context pair (w, cpos ) with its k noise words cneg1 ...cnegk ,
Loss function for one w with c
we can express these two goals as the following loss posfunction
neg1,c ...c
negk
L to be minimized
(hence the ); here the first term expresses that we want the classifier to assign the
Maximize the
real context similarity
word of the
cpos a high targetof
probability with thea neighbor,
being actual context
and the words,
second term
and minimize
expresses thewant
that we similarity
to assignofeach
theoftarget withwords
the noise the kcneg
negative
a high sampled of
probability
i
non-neighbor words.all multiplied because we assume independence:
being a non-neighbor,
" k
#
Y
LCE = log P(+|w, cpos ) P( |w, cnegi )
i=1
" k
#
X
= log P(+|w, cpos ) + log P( |w, cnegi )
i=1
" k
#
X
= log P(+|w, cpos ) + log 1 P(+|w, cnegi )
i=1
" k
#
X
= log s (cpos · w) + log s ( cnegi · w) (6.34)
i=1
Learning the classifier
How to learn?
◦ Stochastic gradient descent!

We’ll adjust the word weights to


◦ make the positive pairs more likely
◦ and the negative pairs less likely,
◦ over the entire training set.
Intuition of one step of gradient descent
aardvark
move apricot and jam closer,
apricot w increasing cpos w
W
“…apricot jam…”
zebra
! aardvark move apricot and matrix apart
cpos decreasing cneg1 w
jam

C matrix cneg1
k=2
Tolstoy cneg2 move apricot and Tolstoy apart
decreasing cneg2 w
zebra
Reminder: gradient descent
• At each step
• Direction: We move in the reverse direction from the
gradient of the loss function
GISTIC R EGRESSION
• Magnitude: we move the value of this gradient
(
𝐿(𝑓 𝑥; 𝑤 , 𝑦) weighted by a learning rate η
()
• Higher learning rate means move w faster

t+1 t d
w =w h L( f (x; w), y)
dw
pos k
X
resses that we=want log
to assign
P(+|w, each
cposof
) +the noise
log 1wordsP(+|w, cnegi caneg
high) probability o
The derivatives
ng• a non-neighbor, all multiplied of the
because
V ECTOR S EMANTICS AND E MBEDDINGS i=1
loss
we function
assume independence:
i

" " k
# #
Y
X
LCE == log log sP(+|w, cpos+)
(cpos · w) log s|w,
P( ( ccneg
negi )i · w) (6.34
oof as an exercise at the end of the chapter):
i=1
" #
k of the word with the actual contex
is, we want ∂toLmaximize
CE
the dot product
X
s, and minimize = the dot =log [s
P(+|w,
(cpos c·pos
products w) +1]w
of)the logwith
word P( the
|w, ckneg )
negative sampled non
∂ cpos i=1
i
hbor words. " #
∂ L
We minimize this lossCE function using k
stochastic gradient descent. Fig. 6.1
X
=log [s (c · w)]w
negc ) +
P(+|w, log 1 P(+|w, cnegi )
s the intuition∂ cofnegone step of learning.
= pos
i=1
" k
X #
∂ LCE Xk
= [s (cpos · w) 1]cpos +
aardvark [s (cnegi · w)]cnegi
∂ w log s (cpos · w)move
= + log s ( cnegi · w)
apricot and jam closer,
(6.3
i=1
i=1
increasing c w
apricot w
= [s (cpos · w) 1]cpos + [s (cnegi · w)]cnegi
∂w
Update equation in SGD i=1

The update equations going from time step t to t + 1 in stochastic gradient de


are thus:Start with randomly initialized C and W matrices, then incrementally do updates
t+1 t t t t
cpos = cpos h[s (cpos · w ) 1]w
t+1 t t t t
cneg = cneg h[s (cneg · w )]w
" k
#
X
t+1 t t t
w = w h [s (cpos · w ) 1]cpos + [s (cnegi · w )]cnegi
i=1

Just as in logistic regression, then, the learning algorithm starts with randoml
ialized W and C matrices, and then walks through the training corpus using gra
descent to move W and C so as to maximize the objective in Eq. 6.34 by makin
Two sets of embeddings
SGNS learns two sets of embeddings
Target embeddings matrix W
Context embedding matrix C
It's common to just add them together,
representing word i as the vector wi + ci
Summary: How to learn word2vec (skip-gram)
embeddings
Start with V random d-dimensional vectors as initial
embeddings
Train a classifier based on embedding similarity
◦Take a corpus and take pairs of words that co-occur as positive
examples
◦Take pairs of words that don't co-occur as negative examples
◦Train the classifier to distinguish these by slowly adjusting all
the embeddings to improve the classifier performance
◦Throw away the classifier code and keep the embeddings.
Word2vec: Learning the
Vector embeddings
Semantics &
Embeddings
Properties of Embeddings
Vector
Semantics &
Embeddings
The kinds of neighbors depend on window size
Small windows (C= +/- 2) : nearest words are syntactically
similar words in same taxonomy
◦Hogwarts nearest neighbors are other fictional schools
◦Sunnydale, Evernight, Blandings
Large windows (C= +/- 5) : nearest words are related
words in same semantic field
◦Hogwarts nearest neighbors are Harry Potter world:
◦Dumbledore, half-blood, Malfoy
ability to capture relational meanings. In an important early vector space model of
Analogical relations
cognition, Rumelhart and Abrahamson (1973) proposed the parallelogram model
for solving simple analogy problems of the form a is to b as a* is to what?. In such
Thea system
problems, classic parallelogram
given model of analogical
a problem like apple:tree::grape:?, i.e., applereasoning
is to tree as
grape is to , and must fill in
(Rumelhart and Abrahamson 1973) the word vine. In the parallelogram model, illus-
# » # »
trated in Fig. 6.15, the vector from the word apple to the word tree (= apple tree)
Totosolve:
is added "apple
the vector for grapeis (to
# treetheasnearest
»
grape); grape istotothat_____"
word point is returned.

Add tree – apple to grape to get vine


tree
apple

vine
grape
arallelogram method received more modern attention because of
Analogical relations via parallelogram
word2vec or GloVe vectors (Mikolov et al. 2013b, Levy and Gold
#
ngton et al. 2014). For example, the result of the expression (kin
»n is a The parallelogram # method
» can solve# analogies
» # with
vector close to queen. Similarly, Paris France + Italy)
» # »
both sparse# and» dense embeddings (Turney and
that isLittman
close to2005,
Rome. The embedding
Mikolov et al. 2013b) model thus seems to be ex
ations of relations like MALE - FEMALE, or CAPITAL - CITY- OF, or
king – man + woman is close to
TIVE / SUPERLATIVE , as shown in Fig. 6.16 from GloVe.
queen
or a a:b::a*:b*Parisproblem,
– Francemeaning
+ Italy isthe
close to Rome
algorithm is given a, b, and
For a problemmethod
*, the parallelogram a:a*::b:b*, the parallelogram method is:
is thus:
⇤ ⇤
b̂ = argmax distance(x, a a + b)
x
Structure in GloVE Embedding space
Caveats with the parallelogram method
It only seems to work for frequent words, small
distances and certain relations (relating countries to
capitals, or parts of speech), but not others. (Linzen
2016, Gladkova et al. 2016, Ethayarajh et al. 2019a)

Understanding analogy is an open area of research


(Peterson et al. 2020)
Embeddings as a window onto historical semantics
Train embeddings on different decades of historical text to see meanings shift
~30 million books, 1850-1990, Google Books data

William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic Word Embeddings Reveal
Statistical Laws of Semantic Change. Proceedings of ACL.
Embeddings reflect cultural bias!
Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. "Man is to computer
programmer as woman is to homemaker? debiasing word embeddings." In NeurIPS, pp. 4349-4357. 2016.

Ask “Paris : France :: Tokyo : x”


◦ x = Japan
Ask “father : doctor :: mother : x”
◦ x = nurse
Ask “man : computer programmer :: woman : x”
◦ x = homemaker
Algorithms that use embeddings as part of e.g., hiring searches for
programmers, might lead to bias in hiring
Historical embedding as a tool to study cultural biases
Garg, N., Schiebinger, L., Jurafsky, D., and Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes.
Proceedings of the National Academy of Sciences 115(16), E3635–E3644.

• Compute a gender or ethnic bias for each adjective: e.g., how


much closer the adjective is to "woman" synonyms than
"man" synonyms, or names of particular ethnicities
• Embeddings for competence adjective (smart, wise,
brilliant, resourceful, thoughtful, logical) are biased toward
men, a bias slowly decreasing 1960-1990
• Embeddings for dehumanizing adjectives (barbaric,
monstrous, bizarre) were biased toward Asians in the
1930s, bias decreasing over the 20th century.
• These match the results of old surveys done in the 1930s
Properties of Embeddings
Vector
Semantics &
Embeddings

You might also like