0% found this document useful (0 votes)
11 views17 pages

11.chapter8 WordEmbedding

Uploaded by

Minh Mai Ngọc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views17 pages

11.chapter8 WordEmbedding

Uploaded by

Minh Mai Ngọc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Natural Language Processing

AC3110E

1
Chapter 8: Word embedding

Lecturer: PhD. DO Thi Ngoc Diep


SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Introduction

• Vector semantics: representations of the (embedded) meaning of words


• representation learning instead of creation by hand
• important in meaning-related tasks (question-answering or dialogue)
• Word meaning views:
• a string of letters
• index in a vocabulary list
• in word relationships: similar meaning, synonym/antonyms, positive/negative
connotations, etc.
• How to represent the words in sense?
• Lexical semantics
• Words and embedding
• Similarity measuring
• Word2vec
• Semantic properties of embeddings
• Evaluating vector models

3
8.1 Lexical Semantics

• Common terms:
• Word form, Lemma, Word sense
word form

mice,

• Lemmas can be polysemous (have multiple senses)


• Relations between word senses
• Synonymy/synonym: meaning is identical or nearly identical
water/H20
"H20" in a surfing guide?
big/large
my big sister != my large sister

• Other relations between words ! WordNet, a thesaurus containing


lists of synonym sets and hypernyms

4
Relations between word senses

• Similar words:
• Words with similar meanings. Not synonyms, but sharing some element of meaning

• Word Similarity
• help in computing how similar the meaning of two phrases
or sentences

SimLex-999 dataset
• Word Relatedness/word association

• Same semantic field:

• Semantic Frames and Roles:


• denote perspectives or semantic frame participants in a particular type of event:
shell/buy, seller/buyer

5
Relations between word senses

• Antonym
• “Similar” words that are opposites with respect to only one feature of meaning

• Connotations (affective meanings)


• related to a writer or reader’s emotions, sentiment, opinions, etc.
• Positive connotations: happy
• Negative connotations: sad

• Positive connotation: copy, replica, reproduction


• Negative connotation: fake, knockoff, forgery

• Positive evaluation: great, love


• Negative evaluation: terrible, hate

6
8.2 Word semantic vectors

• Standard way to represent word meaning in NLP


• Vectors to represent words
• Represent a word as a point in a multidimensional semantic space
• Show the distributions of word neighbors
• Can compute semantic similarity based on vector distance

Negative words
Neutral function
words

Positive words
A two-dimensional (t-SNE) projection of
embeddings for some words and phrases

• => offers enormous power to NLP applications: provide similar meanings instead of
word forms only
• Vector semantic models
• Learned automatically from text
• tf-idf model, word2vec model
• etc.
Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An Introduction
to Natural Language Processing, Computational Linguistics, and Speech Recognition 7
8.3 Word embeddings

• Previous traditional word models


• one-hot word vectors, Word/term-document matrix, Word-word matrix, tf-idf, etc.
• sparse, long vector => problem for applying, training and optimization
• How to create a word representation with short, dense vectors
• Distributional semantics: “A word’s meaning is given by the words that frequently
appear close-by”
• Based on the context of words (set of words that appear nearby: within a fixed-size
window) to discover the word meaning
• Use the many contexts of w to build up a representation of w
• vectors of words that appear in similar contexts, will have high similarity score
• Embeddings: word vectors, (word) embeddings, (neural) word representations, etc.

https://web.stanford.edu/class/cs224n/ 8
8.3 Word embeddings

• Method for computing word vectors:


• “Neural Language Model”-inspired models
• Word2vec (skipgram, CBOW)
• GloVe
• FastText
• Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA)
• Dynamic contextual embeddings
• ELMo, BERT
• Compute distinct embeddings for a word in its context

*sometimes the algorithm is loosely referred to as word2vec 9


Word2vec

• A framework for learning word vectors


• 2 models:
• skip-gram (Mikolov et al., 2013)
• Predict context words by the target word
• works well with small amount of the training data
• represents well even rare words or phrases
• Continuous bag of words (CBOW) (Mikolov et al., 2013)
• Based on context words to predict target words.
• several times faster to train than the skip-gram
• slightly better accuracy for the frequent words

*sometimes the algorithm is loosely referred to as word2vec 10


Word2vec

• Skip-gram models
• Naïve softmax (simple but expensive loss function, when many output classes)
• More optimized variants like hierarchical softmax
• Negative sampling with logistic regression
• Skip-gram as logistic regression
• Instead of counting how often each word c occurs near a target word w => train a
binary classifier on prediction task: “Is word c likely to show up near that target
word w?”
• Use logistic regression to train the classifier to distinguish those two cases +/-.
• Self-supervision:
• only use running text as implicitly supervised training data
• a word c that occurs near the target word w acts as ‘positive example’
• randomly sample other words in the lexicon to get negative samples (negative sampling ).
• The learned weights are used as the embeddings.

11
Skip-gram model as logistic regression

• The classifier
• context, can be a window of ±n words
• near: a word is c likely to occur near the target w if its embedding vector is similar to
the target embedding:
• Similarity(w,c) ≈ c·w
• Model the probability that word c is/is not a real context word for target word w:
1 1
𝑃 + 𝑤, 𝑐 = 𝜎 𝐜. 𝐰 = 1+exp(−𝐜.𝐰); 𝑃 − 𝑤, 𝑐 = 1 − 𝑃 + 𝑤, 𝑐 = 1+exp(𝐜.𝐰)
• Model the probability that a sequence c1..cL is a real context sequence for target
word w:
𝑃 + 𝑤, 𝑐1:𝐿 = ς𝐿𝑖=1 𝜎 𝐜𝐢 . 𝐰
• Each word has two embedding: one for the word as a target, and one for the word
considered as context
• For all |V| words in the vocabulary
=> need to build two matrices W and C.

12
Skip-gram model as logistic regression

• Learning embeddings - Negative sampling


• Input: a corpus of text
• Assigning a random embedding vector for each of the words
• Create positive examples and negative examples from corpus of text
k=2

L=±2

• Minimize this loss function that:


• Maximize the similarity of (w,cpos) word pairs
• Minimize the similarity of (w,cneg) word pairs.
• 𝐿𝐶𝐸 = − log 𝑃 + 𝑤, 𝑐𝑝𝑜𝑠 ς𝑘𝑖=1 𝑃 − 𝑤, 𝑐𝑛𝑒𝑔𝑖 = −[log 𝜎 𝑐𝑝𝑜𝑠 ⋅ 𝑤 + σ𝑘𝑖=1 log 𝜎 −𝑐𝑛𝑒𝑔𝑖 ⋅ 𝑤 ]
• Walks through the training corpus using stochastic gradient descent to iteratively
update the embedding of each word => W, C matrices

13
Visualizing Embeddings - Semantic properties of embeddings

• The most common visualization method : t-SNE


• project the d dimensions of a word down into 2 dimensions

GloVe vector king − man + w𝑜𝑚𝑎𝑛


close to queen

Capture comparative
and superlative
morphology

 Preservation of semantic and syntactic relationships !


Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An Introduction
to Natural Language Processing, Computational Linguistics, and Speech Recognition 14
8.4 Evaluating Vector Models

• Extrinsic evaluation on tasks


• using vectors in an NLP task and seeing whether this improves performance over
some other model
• Intrinsic evaluations
• computing the correlation between an algorithm’s word similarity scores and word
similarity ratings assigned by humans
• WordSim-353, SimLex-999: datasets present words without context
• Stanford Contextual Word Similarity (SCWS), Word-in-Context (WiC) dataset: include
context

15
Resources

• Word2vec (Mikolov et al)


• https://code.google.com/archive/p/word2vec/
• GloVe (Pennington, Socher, Manning)
• http://nlp.stanford.edu/projects/glove/

16
17

You might also like