0% found this document useful (0 votes)
8 views

2 Vector Semantics

David Packard published a concordance to Livy's works in 1968, which was an index of all notable words used in the works of the Roman historian Titus Livius, known in English as Livy, organized in alphabetical order. Livy's history of Rome covered the period from the earliest legends of Rome's founding in 753 BC down to 9 BC and was written in Latin. The concordance published by David Packard in 1968 indexed all notable words used in Livy's works.

Uploaded by

1767824623
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

2 Vector Semantics

David Packard published a concordance to Livy's works in 1968, which was an index of all notable words used in the works of the Roman historian Titus Livius, known in English as Livy, organized in alphabetical order. Livy's history of Rome covered the period from the earliest legends of Rome's founding in 753 BC down to 9 BC and was written in Latin. The concordance published by David Packard in 1968 indexed all notable words used in Livy's works.

Uploaded by

1767824623
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

David Packard, A Concordance to Livy (1968)

Natural Language Processing


Info 159/259
Lecture 2: Vector semantics and static word embeddings
(Jan 20, 2022)

David Bamman, UC Berkeley


“TOM!” No answer. “TOM!” No answer. “What's gone with that boy, I wonder?
You TOM!” No answer. The old lady pulled her spectacles down and looked
over them about the room; then she put them up and looked out under them.
She seldom or never looked through them for so small a thing as a boy; they
were her state pair, the pride of her heart, and were built for “style,” not
service--she could have seen through a pair of stove-lids just as well. She
looked perplexed for a moment, and then said, not fiercely, but still loud
enough for the furniture to hear: “Well, I lay if I get hold of you I'll--” She did not
finish, for by this time she was bending down and punching under the bed
with the broom, and so she needed breath to punctuate the punches with. She
resurrected nothing but the cat. “I never did see the beat of that boy!” She
went to the open door and stood in it and looked out among the tomato vines
and “jimpson” weeds that constituted the garden. No Tom. So she lifted up her
voice at an angle calculated for distance and shouted: “Y-o-u-u TOM!” There
was a slight noise behind her and she turned just in time to seize a small boy
by the slack of his roundabout and arrest his flight. “There! I might 'a' thought
of that closet. What you been doing in there?” “Nothing.” “Nothing! Look at
“TOM!” No answer. “TOM!” No answer. “What's gone with that boy, I wonder?
You TOM!” No answer. The old lady pulled her spectacles down and looked
over them about the room; then she put them up and looked out under them.
She seldom or never looked through them for so small a thing as a boy; they
were her state pair, the pride of her heart, and were built for “style,” not
service--she could have seen through a pair of stove-lids just as well. She
looked perplexed for a moment, and then said, not fiercely, but still loud
enough for the furniture to hear: “Well, I lay if I get hold of you I'll--” She did not
finish, for by this time she was bending down and punching under the bed
with the broom, and so she needed breath to punctuate the punches with. She
resurrected nothing but the cat. “I never did see the beat of that boy!” She
went to the open door and stood in it and looked out among the tomato vines
and “jimpson” weeds that constituted the garden. No Tom. So she lifted up her
voice at an angle calculated for distance and shouted: “Y-o-u-u TOM!” There
was a slight noise behind her and she turned just in time to seize a small boy
by the slack of his roundabout and arrest his flight. “There! I might 'a' thought
of that closet. What you been doing in there?” “Nothing.” “Nothing! Look at
Lexical semantics

“You shall know a word by the company it keeps”


[Firth 1957]

Zellig Harris, “Distributional Structure” (1954) Ludwig Wittgenstein, Philosophical Investigations (1953)
everyone likes ______________

a bottle of ______________ is on the table

______________ makes you drunk

a cocktail with ______________ and seltzer


context

everyone likes ______________

a bottle of ______________ is on the table

______________ makes you drunk

a cocktail with ______________ and seltzer


Distributed representation

• Vector representation that encodes information about the distribution of


contexts a word appears in

• Words that appear in similar contexts have similar representations (and similar
meanings, by the distributional hypothesis).

• We have several different ways we can encode the notion of “context.”


Term-document matrix
Romeo Julius
Hamlet Macbeth Richard III Tempest Othello King Lear
& Juliet Caesar

knife 1 1 4 2 2 10

dog 6 12 2

sword 2 2 7 5 5 17

love 64 135 63 12 48

like 75 38 34 36 34 41 27 44

Context = appearing in the same document.


Vectors

knife 1 1 4 2 2 10

sword 2 2 7 5 5 17

Vector representation of the term; vector size


= number of documents
Cosine Similarity
=
( , )=
= =

• We can calculate the cosine similarity of two vectors to judge the degree of
their similarity [Salton 1971]

• Euclidean distance measures the magnitude of distance between two points


• Cosine similarity measures their orientation
Hamlet Macbeth R&J R3 JC Tempest Othello KL

knife 1 1 4 2 2 10

dog 6 12 2

sword 2 2 7 5 5 17

love 64 135 63 12 48

like 75 38 34 36 34 41 27 44

cos(knife, knife) 1

cos(knife, dog) 0.11

cos(knife, sword) 0.99

cos(knife, love) 0.65

cos(knife, like) 0.61


Weighting dimensions

• Not all dimensions are equally informative


TF-IDF

• Term frequency-inverse document frequency

• A scaling to represent a feature as function of how frequently it appears in


a data point but accounting for its frequency in the overall collection

• IDF for a given term = the number of documents in collection / number of


documents that contain term
TF-IDF
• Term frequency (tft,d) = the number of times term t occurs in document d;
several variants (e.g., passing through log function).

• Inverse document frequency = inverse fraction of number of documents


containing (Dt) among total number of documents N

N
tf idf (t, d) = tft,d log
Dt
IDF
Romeo Richard Julius King
Hamlet Macbeth Tempest Othello IDF
& Juliet III Caesar Lear

knife 1 1 4 2 2 2 0.12

dog 2 6 6 2 12 0.20

sword 17 2 7 12 2 17 0.12

love 64 135 63 12 48 0.20

like 75 38 34 36 34 41 27 44 0

IDF for the informativeness of the terms when


comparing document vectors
PMI

• Mutual information provides a measure of how independent two variables


(X and Y) are.

• Pointwise mutual information measures the independence of two


outcomes (x and y)
PMI
P (x, y)
log2
P (x)P (y)

P (w, c) What’s this value for w and c


w = word, c = context log2 that never occur together?
P (w)P (c)

P (w, c)
P P M I = max log2 ,0
P (w)P (c)
Romeo Richard Julius King
Hamlet Macbeth Tempest Othello total
& Juliet III Caesar Lear

knife 1 1 4 2 2 2 12

dog 2 6 6 2 12 28

sword 17 2 7 12 2 17 57

love 64 135 63 12 48 322

like 75 38 34 36 34 41 27 44 329

total 159 41 186 119 34 59 27 123 748

135
748
P M I(love, R&J) = 186 322
748 748
Term-context matrix

• Rows and columns are both words; cell counts = the number of times
word wi and wj show up in the same context (e.g., a window of 2
tokens).
Dataset

• the big dog ate dinner

• the small cat ate dinner

• the white dog ran down


the street

• the yellow cat ran inside


Dataset
DOG terms (window = 2)
• the big dog ate dinner
the big ate dinner the
• the small cat ate dinner white ran down

• the white dog ran down


the street

• the yellow cat ran inside


Dataset
DOG terms (window = 2)
• the big dog ate dinner
the big ate dinner the
• the small cat ate dinner white ran down

• the white dog ran down CAT terms (window = 2)


the street the small ate dinner the
yellow ran inside
• the yellow cat ran inside
Term-context matrix
contexts

the big ate dinner …


term

dog 2 1 1 1 …

cat 2 0 1 1 …

• Each cell enumerates the number of times a context word appeared in a


window of 2 words around the term.

• How big is each representation for a word here?


We can also define “context” to be directional ngrams (i.e., ngrams of
a defined order occurring to the left or right of the term)

Dataset
DOG terms (window = 2)

• the big dog ate dinner


L: the big, R: ate dinner,
• the small cat ate dinner L: the white, R: ran down

• the white dog ran down CAT terms (window = 2)


the street
L: the small, R: ate
dinner, L: the yellow, R:
• the yellow cat ran inside
ran inside
Term-context matrix
contexts

R: ate
L: the big L: the small L: the yellow …
dinner

dog 1 1 0 0 …
term

cat 0 1 1 1 …

• Each cell enumerates the number of time a directional context phrase appeared in a
specific position around the term.
write a book
write a poem

• First-order co-occurrence (syntagmatic association): write co-occurs with


book in the same sentence.

• Second-order co-occurrence (paradigmatic association): book co-occurs


with poem (since each co-occur with write)
Syntactic context

Lin 1998; Levy and Goldberg 2014


Evaluation
Intrinsic Evaluation
human
word 1 word 2
score

midday noon 9.29

• Relatedness: correlation journey voyage 9.29


(Spearman/Pearson) between
vector similarity of pair of words car automobile 8.94

and human judgments … … …

professor cucumber 0.31

king cabbage 0.23

WordSim-353 (Finkelstein et al. 2002)


Intrinsic Evaluation
• Analogical reasoning (Mikolov et al. 2013). For analogy Germany : Berlin ::
France : ???, find closest vector to v(“Berlin”) - v(“Germany”) + v(“France”)

target
possibly impossibly certain uncertain
generating generated shrinking shrank
think thinking look looking
Baltimore Maryland Oakland California
shrinking shrank slowing slowed
Rabat Morocco Astana Kazakhstan
A 0

Sparse vectors a
aa
aal
0
0
0
aalii 0
aam 0
Aani 0
aardvark 1

“aardvark” aardwolf 0
... 0
zymotoxic 0
zymurgy 0
Zyrenian 0
Zyrian 0

V-dimensional vector, single 1 for the Zyryan 0


zythem 0
identity of the element Zythia 0
zythum 0
Zyzomys 0
Zyzzogeton 0
Dense
1

vectors
0.7

→ 1.3

-4.5
Singular value decomposition
• Any n⨉p matrix X can be decomposed into the product of three
matrices (where m = the number of linearly independent rows)

9
4
3
1
2
⨉ 7 ⨉
9
8
1

nxm mxm mxp


(diagonal)
Singular value decomposition
• We can approximate the full matrix by only considering the leftmost k
terms in the diagonal matrix

9
4
0
0
0
⨉ 0 ⨉
0
0
0

nxm mxm mxp


(diagonal)
Singular value decomposition
• We can approximate the full matrix by only considering the leftmost k
terms in the diagonal matrix (the k largest singular values)

9
4
0
0
0
⨉ 0 ⨉
0
0
0

nxm mxm mxp


Romeo Richard Julius King
Hamlet Macbeth Tempest Othello
& Juliet III Caesar Lear

knife 1 1 4 2 2 2

dog 2 6 6 2 12

sword 17 2 7 12 2 17

love 64 135 63 12 48

like 75 38 34 36 34 41 27 44

knife Hamlet Macbeth Romeo Richard Julius Tempest Othello King


& Juliet III Caesar Lear

dog

sword

love

like
Low-dimensional Low-dimensional
representation for representation for
terms (here 2-dim) documents (here 2-dim)

knife Hamlet Macbeth Romeo Richard Julius Tempest Othello King


& Juliet III Caesar Lear

dog

sword

love

like
Latent semantic analysis

• Latent Semantic Analysis/Indexing (Deerwester et al. 1998) is this process


of applying SVD to the term-document co-occurence matrix

• Terms typically weighted by tf-idf

• This is a form of dimensionality reduction (for terms, from a D-dimensionsal


sparse vector to a K-dimensional dense one), K << D.
Dense vectors from prediction

• Learning low-dimensional representations of words by framing a


predicting task: using context to predict words in a surrounding window

• Transform this into a supervised prediction problem; similar to language


modeling but we’re ignoring order within the context window
Dense vectors from prediction
Word2vec Skipgram model x y

(Mikolov et al. 2013): given a


gin a
single word in a sentence,
predict the words in a context gin cocktail
window around it.
gin with

gin and
a cocktail with gin
gin seltzer
and seltzer
Window size = 3
Dimensionality reduction
… …
the 1
a 0 the
an 0
for 0 4.1

in 0 -0.9
on 0
dog 0
cat 0
… …

the is a point in V-dimensional space the is a point in 2-dimensional space


W V

gin x1 y gin
h1

cocktail x2 y cocktail
h2

globe x3 y globe

x W V y

gin 0 -0.5 1.3 4.1 0.7 0.1 1

cocktail 1 0.4 0.08 -0.9 1.3 0.3 0

globe 0 1.7 3.1 0


W V

gin x1 y gin
h1

cocktail x2 y cocktail
h2

globe x3 y globe

Only one of the inputs W V


is nonzero.
-0.5 1.3 4.1 0.7 0.1

= the inputs are really 0.4 0.08 -0.9 1.3 0.3


Wcocktail
1.7 3.1
x W
0.13 0.56
-1.75 0.07
0.80 1.19
-0.11 1.38
-0.62 -1.46
-1.16 -1.24
0.99 -0.26
-1.46 -0.85
0.79 0.47
0.06 -1.21 x W =
-0.31 0.00
1 -1.01 -2.52 -1.01 -2.52
-1.50 -0.14
-0.14 0.01
-0.13 -1.76
-1.08 -0.56 This is the embedding
-0.17 -0.74
of the context
0.31 1.03
-0.24 -0.84
-0.79 -0.18
Word embeddings

• Can you predict the output word from a vector representation of the
input word?

• Rather than seeing the input as a one-hot encoded vector specifying


the word in the vocabulary we’re conditioning on, we can see it as
indexing into the appropriate row in the weight matrix W
Word embeddings
• Similarly, V has one H-dimensional vector for each element in the vocabulary
(for the words that are being predicted)

gin cocktail cat globe


This is the embedding
of the word
4.1 0.7 0.1 1.3

-0.9 1.3 0.3 -3.4


1 2 3 4 … 50

the 0.418 0.24968 -0.41242 0.1217 … -0.17862

, 0.013441 0.23682 -0.16899 0.40951 … -0.55641

. 0.15164 0.30177 -0.16763 0.17684 … -0.31086

of 0.70853 0.57088 -0.4716 0.18048 … -0.52393

to 0.68047 -0.039263 0.30186 -0.17792 … 0.13228

… … … … … … …

chanty 0.23204 0.025672 -0.70699 -0.04547 … 0.34108

kronik -0.60921 -0.67218 0.23521 -0.11195 … 0.85632

rolonda -0.51181 0.058706 1.0913 -0.55163 … 0.079711

zsombor -0.75898 -0.47426 0.4737 0.7725 … 0.84014

sandberger 0.072617 -0.51393 0.4728 -0.52202 … 0.23096

https://nlp.stanford.edu/projects/glove/
y
dog
cat puppy

wrench

screwdriver
• Why this behavior? dog, cat show up in similar positions

the black cat jumped on the table

the black dog jumped on the table

the black puppy jumped on the table

the black skunk jumped on the table

the black shoe jumped on the table


• Why this behavior? dog, cat show up in similar positions

the black [0.4, 0.08] jumped on the table

the black [0.4, 0.07] jumped on the table

the black puppy jumped on the table

the black skunk jumped on the table

the black shoe jumped on the table

To make the same predictions, these numbers need to be close to each other.
“Word embedding” in NLP papers

0.7

0.525

0.35

0.175

0
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019

Data from ACL papers in the ACL Anthology (https://www.aclweb.org/anthology/)


Analogical inference

• Mikolov et al. 2013 show that vector representations have some potential for
analogical reasoning through vector arithmetic.

apple - apples ≈ car - cars


king - man + woman ≈ queen

Mikolov et al., (2013), “Linguistic Regularities in Continuous Space Word Representations” (NAACL)
Bias

• Allocational harms: automated systems allocate resources unfairly to


different groups (access to housing, credit, parole).

• Representational harms: automated systems represent one group less


favorably than another (including demeaning them or erasing their
existence).

Blodgett et al. (2020), “Language (Technology) is Power: A Critical Survey of “Bias” in NLP”
Representations

• Embeddings for African-American first names are closer to “unpleasant”


words than European names (Caliskan et al. 2017)
• Sentiment analysis over sentences containing African-American first
names are more negative than identical sentences with European names

Kiritchenko and Mohammad (2018), "Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems"
Interrogating “bias”
• Kozlowski et al. (2019), “The
Geometry of Culture:
Analyzing the Meanings of
Class through Word
Embeddings,” American
Sociological Review.

• An et al. 2018, “SemAxis: A


Lightweight Framework to
Characterize Domain-Specific
Word Semantics Beyond
Sentiment”
Low-dimensional distributed
representations

• Low-dimensional, dense word representations are extraordinarily powerful


(and are arguably responsible for much of gains that neural network
models have in NLP).

• Lets your representation of the input share statistical strength with words
that behave similarly in terms of their distributional properties (often
synonyms or words that belong to the same class).
Two kinds of “training” data

• The labeled data for a specific task (e.g., labeled sentiment for movie
reviews): ~ 2K labels/reviews, ~1.5M words → used to train a supervised
model

• General text (Wikipedia, the web, books, etc.), ~ trillions of words → used to
train word distributed representations
Using dense vectors

• In neural models (CNNs, RNNs, LM), replace the V-dimensional sparse


vector with the much smaller K-dimensional dense one.

• Can also take the derivative of the loss function with respect to those
representations to optimize for a particular task.
emoji2vec

Eisner et al. (2016), “emoji2vec: Learning Emoji Representations from their Description”
node2vec

Grover and Leskovec (2016), “node2vec: Scalable Feature Learning for Networks”
Trained embeddings

• Word2vec
https://code.google.com/archive/p/word2vec/

• Glove
http://nlp.stanford.edu/projects/glove/
HW1 out today

• Due Wed Jan 26 @ 11:59pm

You might also like