0% found this document useful (0 votes)
34 views33 pages

Word 2 Vec

Word 2 vector

Uploaded by

jprem637
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views33 pages

Word 2 Vec

Word 2 vector

Uploaded by

jprem637
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Word2Vec

1
Word2Vec - Introduction
➢ Word2Vec is a technique used to learn vector representations
(embeddings) of words, capturing semantic and syntactic relationships
between them.

2
Word2Vec - Introduction
➢ Word2Vec is a technique used to learn vector representations
(embeddings) of words, capturing semantic and syntactic relationships
between them.

➢ The key intuition behind Word2Vec is to represent words in a continuous


vector space where words that share similar contexts in the corpus are
located close to each other in the vector space.

3
Word2Vec - Introduction
➢ Word2Vec is a technique used to learn vector representations
(embeddings) of words, capturing semantic and syntactic relationships
between them.

➢ The key intuition behind Word2Vec is to represent words in a continuous


vector space where words that share similar contexts in the corpus are
located close to each other in the vector space.

➢ Words that appear in similar contexts (surrounded by the same words)


tend to have similar meanings. Word2Vec leverages this by learning from
large text corpora, optimizing word vectors such that words with similar
contexts end up with similar vectors.

4
Word2Vec - Introduction
➢ Some Examples:
➢ Synonyms and Similar Words
➢ happy and joyful often appear in similar contexts, like "She felt very ___ today.“
➢ The vectors for happy and joyful will be close to each other in the embedding
space because they often occur in similar contexts, indicating they are semantically
similar.

5
Word2Vec - Introduction
➢ Some Examples:
➢ Synonyms and Similar Words
➢ happy and joyful often appear in similar contexts, like "She felt very ___ today.“
➢ The vectors for happy and joyful will be close to each other in the embedding
space because they often occur in similar contexts, indicating they are semantically
similar.

➢ Analogies
➢ King is to Man as Queen is to Woman.
➢ The relationship between these words can be captured by vector arithmetic. For
example, if king − man + woman is computed using the corresponding word vectors,
the result is a vector close to queen.
➢ vking − vman + vwoman ≈ vqueen

6
Word2Vec - Introduction
➢ Some Examples:
➢ Associations in Context
➢ Paris is associated with France, and Tokyo is associated with Japan.
➢ The vector relationship can be captured similarly to the analogy example:
➢ vParis − vFrance ≈ vTokyo − vJapan
➢ This shows how countries relate to their capitals in the vector space.

7
Word2Vec - Introduction
➢ Some Examples:
➢ Associations in Context
➢ Paris is associated with France, and Tokyo is associated with Japan.
➢ The vector relationship can be captured similarly to the analogy example:
➢ vParis − vFrance ≈ vTokyo − vJapan
➢ This shows how countries relate to their capitals in the vector space.

➢ Syntactic Relationships
➢ walking is related to walk as running is related to run.
➢ The model can capture these relationships where the vectors for walking and walk
have a similar directional relationship as running and run.
➢ This indicates that Word2Vec also captures the morphological similarities
between words.

8
Word2Vec - Introduction
➢ Some Examples:
➢ Categorical Groupings
➢ Apple, Banana, and Orange are all fruits, while Dog, Cat, and Horse are animals.
➢ Words that belong to the same category (like fruits or animals) tend to cluster
together in the embedding space, reflecting their semantic grouping.

9
Word2Vec - Introduction
➢ Some Examples:
➢ Categorical Groupings
➢ Apple, Banana, and Orange are all fruits, while Dog, Cat, and Horse are animals.
➢ Words that belong to the same category (like fruits or animals) tend to cluster
together in the embedding space, reflecting their semantic grouping.

➢ Causal or Temporal Relationships


➢ Rain often appears in contexts with umbrella or wet, while fire might appear with
smoke or hot.
➢ The vectors for words like rain and umbrella may be closer together because of
their frequent co-occurrence in causal or temporal contexts.

10
Word2Vec – Improvements over 1-Hot Encoding
➢ Sparsity vs. Density
➢ One-Hot Encoding: Each word is represented as a very high-dimensional
vector (the size of the vocabulary), with all elements being zero except one.
This representation is sparse and doesn't capture any meaningful relationships
between words.

➢ Word2Vec: Words are represented by dense vectors, where each dimension


carries some information about the word. This makes it possible to capture
meaningful relationships and similarities between words. Unlike one-hot
encoding, where each word is represented as a sparse, high-dimensional vector,
Word2Vec learns dense, low-dimensional vectors (usually 50 to 300 dimensions).
These dense vectors capture more nuanced relationships between words.

11
Word2Vec – Improvements over 1-Hot Encoding
➢ Lack of Semantic Information
➢ One-Hot Encoding: One-hot vectors are orthogonal to each other, meaning
that no two words are similar based on their representation. This limits the
ability to capture any semantic or syntactic relationships between words.

➢ Word2Vec: Similar words have similar vector representations. For example,


"cat" and "dog" might have vectors that are close together, indicating their
semantic similarity.

12
Word2Vec – Improvements over 1-Hot Encoding
➢ Dimensionality
➢ One-Hot Encoding: The dimensionality of a one-hot vector is equal to the size
of the vocabulary, which can be very large (e.g., hundreds of thousands of
dimensions).

➢ Word2Vec: The dimensionality of word embeddings is much lower (e.g., 50-


300 dimensions), making them more computationally efficient to use in machine
learning models.

13
Word2Vec – Improvements over 1-Hot Encoding
➢ Flexibility in Downstream Tasks
➢ One-Hot Encoding: Doesn't allow for capturing relationships between words,
making it less effective in tasks like semantic similarity, analogy reasoning, and
sentiment analysis.

➢ Word2Vec: Embeddings can be used in various downstream tasks such as text


classification, sentiment analysis, and machine translation, where the
relationships between words are important.

14
Word2Vec – Two Models
➢ Skip-gram: Given a word, predict its surrounding context words.

➢ CBOW (Continuous Bag of Words): Given the surrounding context words,


predict the target word.

15
Word2Vec – Architecture
➢ Both CBOW and Skipgram, are 3-layers neural networks.
➢ Input Layer (Size: |V|)
➢ Hidden Layer (Size: |E|)
➢ Output Layer (Size: |V|)

16
Word2Vec – CBOW Models in Details

Image Source: https://github.com/OlgaChernytska/word2vec-pytorch/tree/main 17


Word2Vec – CBOW Models in Details

where 𝑧𝑖 is the logit


(the raw score from
the neural network) for
word 𝑖.
Image Source: https://github.com/OlgaChernytska/word2vec-pytorch/tree/main 18
Word2Vec – Skipgram Models in Details

Image Source: https://github.com/OlgaChernytska/word2vec-pytorch/tree/main 19


Word2Vec – Skipgram Models in Details

Image Source: https://github.com/OlgaChernytska/word2vec-pytorch/tree/main 20


Word2Vec – Inference-time Behavior
➢ Once the Word2Vec (CBOW/Skipgram) model is trained, the output layer
is dropped, and the word whose dense vector representation (word
embedding) is required is projected on the hidden/embedding layer.

➢ The original Word2Vec was trained on the Google News dataset.

21
Word2Vec – Computational Efficiency
➢ In Word2Vec, predicting the probability distribution over the entire
vocabulary can be computationally expensive, especially when the vocabulary
size 𝑉 is large.

➢The computational complexity of the softmax function for a single


prediction is 𝑂(𝑉), because it requires calculating the exponentials for all 𝑉
words in the vocabulary and then normalizing them. Even during the training,
when the input changes from one example to another, logits and, therefore,
predictions will change, requiring calculations of exponential terms.

22
Word2Vec – Hierarchical Softmax
➢ Hierarchical Softmax is a technique that approximates the softmax
function by representing the probability distribution over the vocabulary as
a binary tree (often a Huffman tree) instead of a flat structure.

➢In this tree:


➢ Internal Nodes: Represent binary decisions (left or right) during traversal.
➢ Leaf Nodes: Correspond to words in the vocabulary.

➢ In traditional softmax, the output layer is flat and has |V| neurons with
softmax activation function. Hierarchical Softmax replaces the output layer
with a binary tree with |V|-1 internal nodes (neurons with sigmoid activation
function) and |V| leaf nodes. Each internal node is fully connected to a
hidden (embedding) layer. Leaf nodes represent vocab words.

23
Word2Vec – Hierarchical Softmax
➢The key idea is that instead of directly computing the probability of a
word, the model computes the probability of a specific path from the root
of the tree to the leaf node corresponding to that word. Each step in the
path represents a binary decision.

24
Word2Vec – Hierarchical Softmax
➢ Example: Let’s consider a small vocabulary: {cat, dog, fish, bird}. We want
to calculate the probability of the word "fish.“

➢ The vocabulary is organized into a binary tree. Here's one possible tree:

➢ The leaf nodes represent the words. Internal nodes represent binary
decisions (left or right).

25
Word2Vec – Hierarchical Softmax
➢ To calculate the probability of "fish," the model predicts the sequence of
decisions to reach the leaf node "fish.“

➢ The path is right (1) -> left (0). We need to compute 𝑃(right at root) and
𝑃(left at node B).

➢ P(fish) = P(right at root) × P(left at node B)

➢Each decision probability is computed as:

𝑃(right at root) = 𝜎(𝑣hidden ⋅ whidden-root + broot)


(Assuming 𝑃(right) is 𝜎 as the encoding above is 1 for right, It can be the other way around)

𝑃(left at node B) = 1 - 𝜎(𝑣hidden ⋅ whidden-B + bB)

where 𝜎(𝑥) is the sigmoid function:𝜎(𝑥)=1/(1+exp(−𝑥))


26
➢ Then, the same Categorical Cross Entropy loss is used
Negative Sampling
➢ It is another technique to address the computational complexity aspect of
Word2Vec

➢ The architecture remains the same (|V|-|E|-|V|), but the output layer has
sigmoid neurons instead of softmax neurons.

➢ Here, for every positive example, we randomly sample k negative examples


from the vocab.

➢ k is between 5 and 20 for small and 2 and 5 for large datasets.

27
Negative Sampling
➢ Input-Groundtruth Preparation
➢ While using negative sampling, each input-groundtruth pair consists of a
positive pair and k negative pairs.
➢ Sentence: “The quick brown fox jumps over the lazy dog.”
➢ Assume context window size = 2

➢ Skip-gram
➢ Input: "fox“
➢ Groundtruth (Positive Context Words): “quick”, “brown”, “jumps”, “over”
➢ Negative Samples for the first example: “dog”, “the” (assuming k=2)
➢ So, the first input-groundtruth pair is (fox-quick, fox-dog, fox-the)
➢ The second input-groundtruth pair is (fox-brown, fox-NegSample1, fox-NegSample2)
etc.

28
Negative Sampling
➢ CBOW
➢ Input (Context Words): “quick”, “brown”, “jumps”, “over”
➢ Groundtruth (Target Word): “fox”
➢ Negative Samples for the first example: “dog”, “the” (assuming k=2)
➢ First input-groundtuth pair is: (quick, brown, jumps, over -> fox; quick, brown,
jumps, over -> dog; quick, brown, jumps, over -> the)

29
Negative Sampling
➢ The loss for a single input-groundtruth pair is defined as

Remember: σ(−x)=1−σ(x)
where:
• 𝜎 is the sigmoid function.
• 𝑣𝑤 is the vector for the correct word. To be precise, the weight
vector between the neuron of the correct/ground truth word and the
hidden (embedding) layer
• 𝑣𝑐 is the context’s (input word’s) embedding vector.
• 𝑣𝑤𝑖 are the weight vectors for the negative samples.
• 𝑘 is the number of negative samples.

30
Negative Sampling
➢ It is like, here, we want our neural network to predict 1 for the positive
input-groundtruth pair while 0 for the negative input-groundtruth pair.

➢ During backpropagation:
➢ Output-Hidden: Only those weights that are between ground truth and hidden are
updated.
➢ Hidden-Input: Only those weights that are between hidden and the input are
updated.

31
Negative Sampling
➢ Sampling Methods:
➢ Uniform Sampling

➢ Frequency-based Sampling

➢ Smoothed Frequency-Based Sampling

➢ Frequency-based sampling with a smoothing factor (usually 0.75) to reduce the probability
of common words being selected. 32
Disclaimer
➢ These slides are not original and have been prepared from various
sources for teaching purposes.

You might also like