Word 2 Vec
Word 2 Vec
1
Word2Vec - Introduction
➢ Word2Vec is a technique used to learn vector representations
(embeddings) of words, capturing semantic and syntactic relationships
between them.
2
Word2Vec - Introduction
➢ Word2Vec is a technique used to learn vector representations
(embeddings) of words, capturing semantic and syntactic relationships
between them.
3
Word2Vec - Introduction
➢ Word2Vec is a technique used to learn vector representations
(embeddings) of words, capturing semantic and syntactic relationships
between them.
4
Word2Vec - Introduction
➢ Some Examples:
➢ Synonyms and Similar Words
➢ happy and joyful often appear in similar contexts, like "She felt very ___ today.“
➢ The vectors for happy and joyful will be close to each other in the embedding
space because they often occur in similar contexts, indicating they are semantically
similar.
5
Word2Vec - Introduction
➢ Some Examples:
➢ Synonyms and Similar Words
➢ happy and joyful often appear in similar contexts, like "She felt very ___ today.“
➢ The vectors for happy and joyful will be close to each other in the embedding
space because they often occur in similar contexts, indicating they are semantically
similar.
➢ Analogies
➢ King is to Man as Queen is to Woman.
➢ The relationship between these words can be captured by vector arithmetic. For
example, if king − man + woman is computed using the corresponding word vectors,
the result is a vector close to queen.
➢ vking − vman + vwoman ≈ vqueen
6
Word2Vec - Introduction
➢ Some Examples:
➢ Associations in Context
➢ Paris is associated with France, and Tokyo is associated with Japan.
➢ The vector relationship can be captured similarly to the analogy example:
➢ vParis − vFrance ≈ vTokyo − vJapan
➢ This shows how countries relate to their capitals in the vector space.
7
Word2Vec - Introduction
➢ Some Examples:
➢ Associations in Context
➢ Paris is associated with France, and Tokyo is associated with Japan.
➢ The vector relationship can be captured similarly to the analogy example:
➢ vParis − vFrance ≈ vTokyo − vJapan
➢ This shows how countries relate to their capitals in the vector space.
➢ Syntactic Relationships
➢ walking is related to walk as running is related to run.
➢ The model can capture these relationships where the vectors for walking and walk
have a similar directional relationship as running and run.
➢ This indicates that Word2Vec also captures the morphological similarities
between words.
8
Word2Vec - Introduction
➢ Some Examples:
➢ Categorical Groupings
➢ Apple, Banana, and Orange are all fruits, while Dog, Cat, and Horse are animals.
➢ Words that belong to the same category (like fruits or animals) tend to cluster
together in the embedding space, reflecting their semantic grouping.
9
Word2Vec - Introduction
➢ Some Examples:
➢ Categorical Groupings
➢ Apple, Banana, and Orange are all fruits, while Dog, Cat, and Horse are animals.
➢ Words that belong to the same category (like fruits or animals) tend to cluster
together in the embedding space, reflecting their semantic grouping.
10
Word2Vec – Improvements over 1-Hot Encoding
➢ Sparsity vs. Density
➢ One-Hot Encoding: Each word is represented as a very high-dimensional
vector (the size of the vocabulary), with all elements being zero except one.
This representation is sparse and doesn't capture any meaningful relationships
between words.
11
Word2Vec – Improvements over 1-Hot Encoding
➢ Lack of Semantic Information
➢ One-Hot Encoding: One-hot vectors are orthogonal to each other, meaning
that no two words are similar based on their representation. This limits the
ability to capture any semantic or syntactic relationships between words.
12
Word2Vec – Improvements over 1-Hot Encoding
➢ Dimensionality
➢ One-Hot Encoding: The dimensionality of a one-hot vector is equal to the size
of the vocabulary, which can be very large (e.g., hundreds of thousands of
dimensions).
13
Word2Vec – Improvements over 1-Hot Encoding
➢ Flexibility in Downstream Tasks
➢ One-Hot Encoding: Doesn't allow for capturing relationships between words,
making it less effective in tasks like semantic similarity, analogy reasoning, and
sentiment analysis.
14
Word2Vec – Two Models
➢ Skip-gram: Given a word, predict its surrounding context words.
15
Word2Vec – Architecture
➢ Both CBOW and Skipgram, are 3-layers neural networks.
➢ Input Layer (Size: |V|)
➢ Hidden Layer (Size: |E|)
➢ Output Layer (Size: |V|)
16
Word2Vec – CBOW Models in Details
➢
21
Word2Vec – Computational Efficiency
➢ In Word2Vec, predicting the probability distribution over the entire
vocabulary can be computationally expensive, especially when the vocabulary
size 𝑉 is large.
22
Word2Vec – Hierarchical Softmax
➢ Hierarchical Softmax is a technique that approximates the softmax
function by representing the probability distribution over the vocabulary as
a binary tree (often a Huffman tree) instead of a flat structure.
➢ In traditional softmax, the output layer is flat and has |V| neurons with
softmax activation function. Hierarchical Softmax replaces the output layer
with a binary tree with |V|-1 internal nodes (neurons with sigmoid activation
function) and |V| leaf nodes. Each internal node is fully connected to a
hidden (embedding) layer. Leaf nodes represent vocab words.
23
Word2Vec – Hierarchical Softmax
➢The key idea is that instead of directly computing the probability of a
word, the model computes the probability of a specific path from the root
of the tree to the leaf node corresponding to that word. Each step in the
path represents a binary decision.
24
Word2Vec – Hierarchical Softmax
➢ Example: Let’s consider a small vocabulary: {cat, dog, fish, bird}. We want
to calculate the probability of the word "fish.“
➢ The vocabulary is organized into a binary tree. Here's one possible tree:
➢ The leaf nodes represent the words. Internal nodes represent binary
decisions (left or right).
25
Word2Vec – Hierarchical Softmax
➢ To calculate the probability of "fish," the model predicts the sequence of
decisions to reach the leaf node "fish.“
➢ The path is right (1) -> left (0). We need to compute 𝑃(right at root) and
𝑃(left at node B).
➢ The architecture remains the same (|V|-|E|-|V|), but the output layer has
sigmoid neurons instead of softmax neurons.
27
Negative Sampling
➢ Input-Groundtruth Preparation
➢ While using negative sampling, each input-groundtruth pair consists of a
positive pair and k negative pairs.
➢ Sentence: “The quick brown fox jumps over the lazy dog.”
➢ Assume context window size = 2
➢ Skip-gram
➢ Input: "fox“
➢ Groundtruth (Positive Context Words): “quick”, “brown”, “jumps”, “over”
➢ Negative Samples for the first example: “dog”, “the” (assuming k=2)
➢ So, the first input-groundtruth pair is (fox-quick, fox-dog, fox-the)
➢ The second input-groundtruth pair is (fox-brown, fox-NegSample1, fox-NegSample2)
etc.
28
Negative Sampling
➢ CBOW
➢ Input (Context Words): “quick”, “brown”, “jumps”, “over”
➢ Groundtruth (Target Word): “fox”
➢ Negative Samples for the first example: “dog”, “the” (assuming k=2)
➢ First input-groundtuth pair is: (quick, brown, jumps, over -> fox; quick, brown,
jumps, over -> dog; quick, brown, jumps, over -> the)
29
Negative Sampling
➢ The loss for a single input-groundtruth pair is defined as
Remember: σ(−x)=1−σ(x)
where:
• 𝜎 is the sigmoid function.
• 𝑣𝑤 is the vector for the correct word. To be precise, the weight
vector between the neuron of the correct/ground truth word and the
hidden (embedding) layer
• 𝑣𝑐 is the context’s (input word’s) embedding vector.
• 𝑣𝑤𝑖 are the weight vectors for the negative samples.
• 𝑘 is the number of negative samples.
30
Negative Sampling
➢ It is like, here, we want our neural network to predict 1 for the positive
input-groundtruth pair while 0 for the negative input-groundtruth pair.
➢ During backpropagation:
➢ Output-Hidden: Only those weights that are between ground truth and hidden are
updated.
➢ Hidden-Input: Only those weights that are between hidden and the input are
updated.
31
Negative Sampling
➢ Sampling Methods:
➢ Uniform Sampling
➢ Frequency-based Sampling
➢ Frequency-based sampling with a smoothing factor (usually 0.75) to reduce the probability
of common words being selected. 32
Disclaimer
➢ These slides are not original and have been prepared from various
sources for teaching purposes.