Word Embeddings Notes
Word Embeddings Notes
The resulting word embeddings have many useful properties. One is that words
with similar meanings or connotations are often close to each other in the
vector space. This means that words like "happy" and "joyful" might be very
close in the vector space, while words like "happy" and "sad" would be far apart.
This can be useful for a wide range of NLP tasks, such as sentiment analysis,
language translation, and information retrieval.
Consider the following similar sentences: Have a good day and Have a great
day. They hardly have different meaning. If we construct an exhaustive
vocabulary (let’s call it V), it would have V = {Have, a, good, great, day}.
Now, let us create a one-hot encoded vector for each of these words in V. Length
of our one-hot encoded vector would be equal to the size of V (=5). We would
have a vector of zeros except for the element at the index representing the
corresponding word in the vocabulary. That particular element would be one.
The encodings below would explain this better.
Have = [1,0,0,0,0]`; a=[0,1,0,0,0]` ; good=[0,0,1,0,0]` ; great=[0,0,0,1,0]` ;
day=[0,0,0,0,1]` (` represents transpose)
If we try to visualize these encodings, we can think of a 5 dimensional space,
where each word occupies one of the dimensions and has nothing to do with
the rest (no projection along the other dimensions). This means ‘good’ and
‘great’ are as different as ‘day’ and ‘have’, which is not true
Word2vec
Word2vec is a specific algorithm for generating word embeddings, developed
by Tomas Mikolov and his team at Google.Word2vec is a type of neural network
that takes in a large corpus of text data and generates word embeddings by
predicting the probability of words appearing together in a given context.
Two different learning models were introduced that can be used as part
of the word2vec approach to learn the word embedding; they are:
Train Model: Pass one hot encoded words through forward pass, calculate
error rate by computing loss, and adjust weights using back propagation.
Output: By using trained model calculate word vector and find similar words.
I will explain CBOW steps without code but if you want full working code
of CBOW with numpy from scratch, I have separate post for that, you can
always jump into that.
1. Data Preparation:
For example, for context word “i” the target word will be “like”. For our
example text full training data will looks like:
One-hot encoding: We need to convert text to one-hot encoding as algorithm
can only understand numeric values.
For example encoded value of the word “i”, which appears first in the
vocabulary, will be as the vector [1, 0, 0, 0, 0]. The word “like”, which appears
second in the vocabulary, will be encoded as the vector [0, 1, 0, 0, 0]
So as you can see above table is our final training data, where encoded target
word is Y variable for our model and encoded context word is X variable for
our model.
3. Training Model:
So far, so good right? Now we need to pass this data into the basic neural
network with one hidden layer and train it. Only one thing to note is that the
desire vector dimension of any word will be the number of hidden nodes.
For this tutorial and demo purpose my desired vector dimension is 3. For
example:
“i” => [0.001, 0.896, 0.763] so number of hidden layer node will be 3.
Language modeling: CBOW is used for predicting the next word in a sentence
or text. This is useful in speech recognition, machine translation, and other
language-related tasks.
Information retrieval: CBOW is used to generate word vectors that are used
in information retrieval systems to match queries with documents.
Text classification: CBOW is used to generate word vectors that are used in
text classification tasks, such as spam detection, topic classification, and
sentiment analysis.
Skip-gram (SG):
It guesses the context words from a target word. This is completely opposite
task than CBOW. Where you have to guess which set of words can be nearby of
a given word with a fixed window size. For below example skip gram model
predicts word surrounding word with window size 4 for given word “jump”
The Skip-gram model is basically the inverse of the CBOW model. The input is
a centre word and the model predicts the context words.
In this section we will be implementing the Skipgram for multi-word
architecture of Word2Vec. Like single word CBOW and multi word CBOW the
content is broken down into the following steps:
1. Data Preparation: Defining corpus by tokenizing text.
2. Generate Training Data: Build vocabulary of words, one-hot encoding for
words, word index.
3. Train Model: Pass one hot encoded words through forward pass, calculate
error rate by computing loss, and adjust weights using back propagation.
Output: By using trained model calculate word vector and find similar words.
I will explain CBOW steps without code but if you want full working code of
Skip-Gram with numpy from scratch, I have separate post for that, you can
always jump into that.
For example, for context word “i” and “natural” the target word will be “like”.
For our example text full training data will looks like:
So as you can see above table is our final training data, where encoded context
word is Y variablefor our model and encoded target word is X variable for our
model as skipgram predicts surrounding word of a given word.
Now we will move on to train our model as we are done with our final training
data.
Now here we are trying to predict two surrounding words that is why number
of nodes in output layer is two. It can be any number depending on how many
surrounding word you are trying to predict. In above picture 1st output node
represents previous word (t-1) and 2nd output node represents after word
(t+1) of given input word.
The number of nodes in each output layer (u11 to u15and u21 to u25) are same as
the number of nodes in the input layer (count of unique vocabulary in our case
is 5). This is because we are dealing with each word in the vocabulary for each
context.
Applications of Skip-gram: