0% found this document useful (0 votes)
3 views

Lecture12 - Word RepEmb

Uploaded by

1162407364
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture12 - Word RepEmb

Uploaded by

1162407364
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Natural Language

Processing
Lecture 12: Lexical Semantics (part I) -
Word Representations and Word Embeddings.

11/30/2020

COMS W4705
Yassine Benajiba
Jabberwocky
• Can you identify what the words in this poem mean?

Beware the jabberwock, my son


the jaws that bite, the claws that catch!

Beware the jubjub bird, and


the frumious bandersnatch!
"Jabberwocky", Lewis Carroll, 1871
Semantic Similarity and
Relatedness
• We can often tell that two words are similar or related, even if
they aren't exact synonyms.

• "fast" is similar to "rapid" and "speed"

• "tall" is similar to "high" and "height"

• Question answering:

• Q: "How tall is Mt. Everest?"

• Candidate A: "The official height of Mount Everest is 29029


feet"
Relatedness
• "cat" is more similar to "dog" than to "table"

• "table" is more similar to "chair" than to "dog"

• "run" is more similar to "fly" than to "think".

• "cat" is more similar to "meow" than to "bark".


Single Word Representation:
One-Hot Vector

0
0 a


1
fish





|V| zythum⋮
0

What about unseen words?


Unknown Words
A bottle of tesgüino is on the table.
Everybody likes tesgüino.
Tesgüino makes you drunk.
We make tesgüino out of corn.
Example from Nida, 1975.
• Can you figure out from context what tesgüino means?

• Some kind of alcoholic beverage, maybe beer or


whisky.

• Intuition: Two words should be similar if they have similar


typical word contexts.
How would you represent context?
Distributional Hypothesis
• Wittgenstein ("Philosophical Investigations):
"the meaning of a word is in its use in the language"

• Zelig Harris (1954):


“oculist and eye-doctor … occur in almost the same environments”

“If A and B have almost identical environments we say that they


are synonyms.”

•J.R. Firth (1957)


"you shall know a word by the company it keeps!"
Co-occurence Matrix
⌼ ⊞ ⊛ ⋔ ⏈ ⍾
⌘ 51 20 84 0 3 0

⌓ 52 58 4 4 6 26
sim(⊠,⌘) = 0.770
⊠ 115 89 10 42 33 17 sim(⊠,⁂) = 0.939
sim(⊠,⌓) = 0.961
⊚ 59 39 23 4 0 0

⁙ 98 14 6 2 1 0

⁂ 12 17 3 2 9 27

⎔ 11 2 2 0 18 0

• Numbers are co-occurence counts (how often the symbols appear


together in context).
• Which symbol is most similar to ⊠?
What it really looks like
get see use hear eat kill
knife 51 20 84 0 3 0

cat 52 58 4 4 6 26 sim(dog,knife) = 0.770


sim(dog,boat) = 0.939
dog 115 89 10 42 33 17
sim(dog,cat) = 0.961
boat 59 39 23 4 0 0

cup 98 14 6 2 1 0

pig 12 17 3 2 9 27

berry 11 2 2 0 18 0
Verb-Object counts

• Row vector xdog describes usage of dog as a grammatical object in the corpus.
• Can be seen as coordinates in n-dimensional Euclidean space.
Geometric Interpretation
• Row vector xdog describes usage of
dog in the corpus.
• Can be seen as coordinates in
n-dimensional Euclidean space.
• Illustrated for two dimensions "get"
and "use".

xdog = (115, 10)


Geometric Interpretation
• How should we compute
similarity?
• First approach: Spatial
distance between words.
• (lower distance = higher
similarity)
• Potential problem: location
depends on frequency of noun
count(dog) ≈ 2.7 count(cat)
Geometric Interpretation
• How should we compute
similarity?
• Second approach:
• Direction is more
important than location.

• Normalize "length" ||xdog|| of


vector.
α=54.3°
• or use angle α as distance sim(dog, knife)=0.58
measure (or cos of these
angles).
Cosine Similarity

Colinear vectors (same direction):

α=54.3°
Orthogonal vectors
(90° angle, no shared attributes):
What to do with DSM
similarities

• Most similar to school:


country (49.3), church (52.1), hospital (53.1), house (54.4),
hotel (55.1), industry (57.0), company (57.0), home (57.7),
family (58.4), university (59.0), party (59.4), group (59.5),
building (59.8), market (60.3), bank (60.4), business (60.9),
area (61.4), department (61.6), club (62.7), town (63.3),
library (63.3), room (63.6), service (64.4), police (64.7),...
Clustering and Semantic
Maps
• Distributional Similarity/Distance
can be used to

• find nearest neighbors (similar


words)

• cluster related words into


hierarchical categories.

• construct semantic maps.


Variations of Distributional
Semantic Models
• A Distributional Semantic Model (DSM) is any matrix M such that
each row represents the distribution of a term x across contexts,
together with a similarity measurement.

• The previous example shows one particular semantic space


(frequency counts of Verb-object co-occurences).

• There are many different models we could choose.

• Different models might capture different "types" of similarity.


Dimensions of Distributional
Semantic Models
1. Preprocessing, definition of "terms" (word form, lemmas, POS, ...).

2. Context definition:

• Type of context (word, syntactic dependents (with or without relation


labels labels), remove stop-words, etc.)

• Size of context window.

3. Feature scaling / term weighting (association measures).

4. Normalization of rows / columns.

5. Dimensionality reduction.

6. Similarity measure.
Effect of context size
Nearest neighbors of dog

2-word window:
cat, horse, fox, pet, rabbit, pig, animal, mongrel, sheep,pigeon

30-word window:
kennel, puppy, pet, terrier, rottweiler, canine, cat, to bark, Alsatian
Term Weighting
• Problem: Not all context terms are equally relevant to
characterize the meaning of a word.

• Some appear too often, some are too rare (Zipfian


distribution). just right
"general"
too frequent "eat"
"the","a","can","may", "explosion" too rare
... (function words) "nations" "antiproliferative"
... "87-year-old"
"Uni7ied"

• One solution: TF*IDF (term frequency * inverse-document


frequency)
TF*IDF
• Originates in document retrieval (find document relevant to a
keyword). For DSM: 'document' = target word d.

• Term frequency: How often does the term t appear in the context
window of the target word?

• Inverse document frequency: For how many words does t appear in


the context window

• TF*IDF:
Sparse vs. Dense Vectors
• Full co-occurrence matrix is very big and contains a lot of 0 entries.

• Potentially inconvenient to store. Slow computation.

• Synonyms may still contain orthogonal dimensions, which are


irrelevant.

• Word embeddings are representations of words in a low-


dimensional, dense vector space. There are two main approaches:

• Use matrix decomposition on co-occurence matrix, for example


Singular Value Decomposition (SVD).

• Learn embeddings using neural networks. Minimal feature-


engineering required.
Learning Word Embeddings
with Neural Networks
• The neural network should capture the relationship between a
word and its context.

• Two models:
(Word2Vec, Mikolov et al. 2013)

• Skip-Gram model: Input is a single word.


Predict a probability for each context word.

• Continuous bag-of-words (BOW):


Input is a representation of the context window.
Predict a probability for each target word.

• Inspired by Neural Language Models (Bengio et al. 2003)


Skip-Gram Model
• Input:
A single word in one-hot representation.

• Output: probability to see any single word as a context word.

0.02 a
0 d hidden

neurons 0.0 thought
0 Σ
0.04 cheese
eat 1 Σ
0 ⋮ 0.03 place
⋮ Σ

0 0.0 run
|V| neurons |V| neurons
softmax activation
• Softmax function normalizes the activation of the output neurons to sum up to 1.0.
Skip-Gram Model
• Compute error with respect to each context word.
wt-c place ...a place to eat delicious cheese .

⋮ (eat, place)
(eat, to)
wt-1 to (eat, delicious)
eat (eat,cheese)
wt+1 delicious
wt

wt+c cheese

• Combine errors for each word, then use combined error to update
weights using back-propagation.
Embeddings are Magic
(Mikolov 2016)

vector(‘king’) - vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’)


Application: Word Pair
Relationships
Using Word Embeddings
• Word2Vec:

• https://code.google.com/archive/p/word2vec/

• GloVe: Global Vectors for Word Representation

• https://nlp.stanford.edu/projects/glove/

• Can either use pre-trained word embeddings or train them


on a large corpus.
Acknowledgments

• Some content adapted from slides by Kathy McKeown,


Dan Jurafsky, Stefan Evert, Marco Baroni

You might also like