11.chapter8 WordEmbedding
11.chapter8 WordEmbedding
AC3110E
1
Chapter 8: Word embedding
3
8.1 Lexical Semantics
• Common terms:
• Word form, Lemma, Word sense
word form
mice,
4
Relations between word senses
• Similar words:
• Words with similar meanings. Not synonyms, but sharing some element of meaning
• Word Similarity
• help in computing how similar the meaning of two phrases
or sentences
SimLex-999 dataset
• Word Relatedness/word association
5
Relations between word senses
• Antonym
• “Similar” words that are opposites with respect to only one feature of meaning
6
8.2 Word semantic vectors
Negative words
Neutral function
words
Positive words
A two-dimensional (t-SNE) projection of
embeddings for some words and phrases
• => offers enormous power to NLP applications: provide similar meanings instead of
word forms only
• Vector semantic models
• Learned automatically from text
• tf-idf model, word2vec model
• etc.
Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An Introduction
to Natural Language Processing, Computational Linguistics, and Speech Recognition 7
8.3 Word embeddings
https://web.stanford.edu/class/cs224n/ 8
8.3 Word embeddings
• Skip-gram models
• Naïve softmax (simple but expensive loss function, when many output classes)
• More optimized variants like hierarchical softmax
• Negative sampling with logistic regression
• Skip-gram as logistic regression
• Instead of counting how often each word c occurs near a target word w => train a
binary classifier on prediction task: “Is word c likely to show up near that target
word w?”
• Use logistic regression to train the classifier to distinguish those two cases +/-.
• Self-supervision:
• only use running text as implicitly supervised training data
• a word c that occurs near the target word w acts as ‘positive example’
• randomly sample other words in the lexicon to get negative samples (negative sampling ).
• The learned weights are used as the embeddings.
11
Skip-gram model as logistic regression
• The classifier
• context, can be a window of ±n words
• near: a word is c likely to occur near the target w if its embedding vector is similar to
the target embedding:
• Similarity(w,c) ≈ c·w
• Model the probability that word c is/is not a real context word for target word w:
1 1
𝑃 + 𝑤, 𝑐 = 𝜎 𝐜. 𝐰 = 1+exp(−𝐜.𝐰); 𝑃 − 𝑤, 𝑐 = 1 − 𝑃 + 𝑤, 𝑐 = 1+exp(𝐜.𝐰)
• Model the probability that a sequence c1..cL is a real context sequence for target
word w:
𝑃 + 𝑤, 𝑐1:𝐿 = ς𝐿𝑖=1 𝜎 𝐜𝐢 . 𝐰
• Each word has two embedding: one for the word as a target, and one for the word
considered as context
• For all |V| words in the vocabulary
=> need to build two matrices W and C.
12
Skip-gram model as logistic regression
L=±2
13
Visualizing Embeddings - Semantic properties of embeddings
Capture comparative
and superlative
morphology
15
Resources
16
17