0% found this document useful (0 votes)
91 views9 pages

Word Embeddings Notes

Word embeddings are a technique to represent words as vectors of numbers that capture meaning based on context. They are generated by training neural networks on large datasets to predict words based on surrounding words. Trained this way, similar words have similar vectors, which is useful for natural language tasks. The Word2Vec algorithm is commonly used to generate word embeddings, using either the Continuous Bag-of-Words or Skip-Gram model to predict target words from contexts or vice versa. These embeddings have been applied to many natural language applications.

Uploaded by

Abhimanyu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views9 pages

Word Embeddings Notes

Word embeddings are a technique to represent words as vectors of numbers that capture meaning based on context. They are generated by training neural networks on large datasets to predict words based on surrounding words. Trained this way, similar words have similar vectors, which is useful for natural language tasks. The Word2Vec algorithm is commonly used to generate word embeddings, using either the Continuous Bag-of-Words or Skip-Gram model to predict target words from contexts or vice versa. These embeddings have been applied to many natural language applications.

Uploaded by

Abhimanyu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Word Embeddings

Word embeddings are a popular technique in natural language processing


(NLP) and machine learning for representing words in a numerical form that
can be easily processed by computers. Word embeddings are a way to
represent words as vectors, or arrays of numbers, that capture the meaning of
those words based on their context.

The most common way to generate word embeddings is through training


neural network models on large datasets of text. The basic idea is to train a
model to predict a word based on the context in which it appears. This can be
done in a variety of ways, such as predicting the next word in a sentence, or
predicting a missing word in a sentence. The trained neural network then
produces a set of numerical values for each word in the vocabulary, which can
be used as the word embeddings.

The resulting word embeddings have many useful properties. One is that words
with similar meanings or connotations are often close to each other in the
vector space. This means that words like "happy" and "joyful" might be very
close in the vector space, while words like "happy" and "sad" would be far apart.
This can be useful for a wide range of NLP tasks, such as sentiment analysis,
language translation, and information retrieval.

Word embeddings have been used in a wide range of NLP applications,


including language modeling, text classification, named entity recognition, and
machine translation. They have also been used in many other fields, such as
computer vision and speech recognition, to represent objects and sounds in a
numerical form that can be easily processed by computers.

Consider the following similar sentences: Have a good day and Have a great
day. They hardly have different meaning. If we construct an exhaustive
vocabulary (let’s call it V), it would have V = {Have, a, good, great, day}.
Now, let us create a one-hot encoded vector for each of these words in V. Length
of our one-hot encoded vector would be equal to the size of V (=5). We would
have a vector of zeros except for the element at the index representing the
corresponding word in the vocabulary. That particular element would be one.
The encodings below would explain this better.
Have = [1,0,0,0,0]`; a=[0,1,0,0,0]` ; good=[0,0,1,0,0]` ; great=[0,0,0,1,0]` ;
day=[0,0,0,0,1]` (` represents transpose)
If we try to visualize these encodings, we can think of a 5 dimensional space,
where each word occupies one of the dimensions and has nothing to do with
the rest (no projection along the other dimensions). This means ‘good’ and
‘great’ are as different as ‘day’ and ‘have’, which is not true
Word2vec
Word2vec is a specific algorithm for generating word embeddings, developed
by Tomas Mikolov and his team at Google.Word2vec is a type of neural network
that takes in a large corpus of text data and generates word embeddings by
predicting the probability of words appearing together in a given context.

Two different learning models were introduced that can be used as part
of the word2vec approach to learn the word embedding; they are:

 Continuous Bag-of-Words, or CBOW model.


 Continuous Skip-Gram Model.

Continuous Bag of Words (CBOW):


It attempts to guess the output (target word) from its neighboring words
(context words). You can think of it like fill in the blank task, where you need
to guess word in place of blank by observing nearby words.

Continuous Bag of Words (CBOW) single-word model:


Implementing the CBOW for single-word architecture of Word2Vec. The
content is broken down into the following steps:

Data Preparation: Defining corpus by tokenizing text.

Generate Training Data: Build vocabulary of words, one-hot encoding for


words, word index.

Train Model: Pass one hot encoded words through forward pass, calculate
error rate by computing loss, and adjust weights using back propagation.
Output: By using trained model calculate word vector and find similar words.
I will explain CBOW steps without code but if you want full working code
of CBOW with numpy from scratch, I have separate post for that, you can
always jump into that.

1. Data Preparation:

Let’s say we have a text like below:


“i like natural language processing”
To make it simple I have chosen a sentence without capitalization and
punctuation. Also I will not remove any stop words (“and”, “the” etc.) but for
real world implementation you should do lots of cleaning task like stop
word removal, replacing digits, remove punctuation etc.

After pre-processing we will convert the text to list of tokenized word.


[“i”, “like”, “natural”, “language”, “processing”]
2. Generate training data:

Unique vocabulary: Find unique vocabulary list. As we don’t have any


duplicate word in our example text, so unique vocabulary will be:

[“i”, “like”, “natural”, “language”, “processing”]


Now to prepare training data for single word CBOW model, we define “target
word” as the word which follows a given word in the text (which will be our
“context word”). That means we will be predicting next word for a given word.
Now let’s construct our training examples, scanning through the text with a
window will prepare a context word and a target word, like so:

For example, for context word “i” the target word will be “like”. For our
example text full training data will looks like:
One-hot encoding: We need to convert text to one-hot encoding as algorithm
can only understand numeric values.
For example encoded value of the word “i”, which appears first in the
vocabulary, will be as the vector [1, 0, 0, 0, 0]. The word “like”, which appears
second in the vocabulary, will be encoded as the vector [0, 1, 0, 0, 0]

So as you can see above table is our final training data, where encoded target
word is Y variable for our model and encoded context word is X variable for
our model.

Now we will move on to train our model.

3. Training Model:
So far, so good right? Now we need to pass this data into the basic neural
network with one hidden layer and train it. Only one thing to note is that the
desire vector dimension of any word will be the number of hidden nodes.

For this tutorial and demo purpose my desired vector dimension is 3. For
example:

“i” => [0.001, 0.896, 0.763] so number of hidden layer node will be 3.

Dimension (n): It is dimension of word embedding you can treat embedding


as number of features or entity like organization, name, gender etc. It can be
10, 20, 100 etc. Increasing number of embedding layer will explain a word or
token more deeply. Just for an example Google pre-trained word2vec have
dimension of 300.
Applications of CBOW

CBOW (Continuous Bag of Words) is a popular algorithm used in natural


language processing (NLP) for word embedding. It is used to predict a target
word from the context words surrounding it.

Here are some of the applications of CBOW:

Language modeling: CBOW is used for predicting the next word in a sentence
or text. This is useful in speech recognition, machine translation, and other
language-related tasks.

Information retrieval: CBOW is used to generate word vectors that are used
in information retrieval systems to match queries with documents.

Sentiment analysis: CBOW is used in sentiment analysis to determine the


sentiment of a text by analyzing the context words around the target word.

Recommendation systems: CBOW can be used to generate word vectors that


are used in recommendation systems to suggest similar products or services.

Text classification: CBOW is used to generate word vectors that are used in
text classification tasks, such as spam detection, topic classification, and
sentiment analysis.
Skip-gram (SG):
It guesses the context words from a target word. This is completely opposite
task than CBOW. Where you have to guess which set of words can be nearby of
a given word with a fixed window size. For below example skip gram model
predicts word surrounding word with window size 4 for given word “jump”

The Skip-gram model is basically the inverse of the CBOW model. The input is
a centre word and the model predicts the context words.
In this section we will be implementing the Skipgram for multi-word
architecture of Word2Vec. Like single word CBOW and multi word CBOW the
content is broken down into the following steps:
1. Data Preparation: Defining corpus by tokenizing text.
2. Generate Training Data: Build vocabulary of words, one-hot encoding for
words, word index.
3. Train Model: Pass one hot encoded words through forward pass, calculate
error rate by computing loss, and adjust weights using back propagation.
Output: By using trained model calculate word vector and find similar words.
I will explain CBOW steps without code but if you want full working code of
Skip-Gram with numpy from scratch, I have separate post for that, you can
always jump into that.

1. Data Preparation for Skip-gram model:


Let’s say we have a text like below:
“i like natural language processing”
To make it simple I have chosen a sentence without capitalization and
punctuation. Also I will not remove any stop words (“and”, “the” etc.) but for
real world implementation you should do lots of cleaning task like stop
word removal, replacing digits, remove punctuation etc.
After pre-processing we will convert the text to list of tokenized word.
Output:
[“i”, “like”, “natural”, “language”, “processing”]
2. Generate training data for Skip-gram model:
Unique vocabulary: Find unique vocabulary list. Here we don’t have any
duplicate word in our example text, so unique vocabulary will be:
[“i”, “like”, “natural”, “language”, “processing”]
Now to prepare training data for multi word Skipgram model, we define
“context word” as the word which follows a given word in the text (which
will be our “target word”). That means we will be predicting surrounding
word for a given word.
Now let’s construct our training examples, scanning through the text with a
window will prepare a context word and a target word, like so:

For example, for context word “i” and “natural” the target word will be “like”.
For our example text full training data will looks like:

One-hot encoding: We need to convert text into one-hot encoding as algorithm


can only understand numeric values.
For example encoded value of the word “i”, which appears first in the
vocabulary, will be as the vector [1, 0, 0, 0, 0]. The word “like”, which appears
second in the vocabulary, will be encoded as the vector [0, 1, 0, 0, 0]
So let’s see overall set of context-target words in one hot encoded form:

So as you can see above table is our final training data, where encoded context
word is Y variablefor our model and encoded target word is X variable for our
model as skipgram predicts surrounding word of a given word.
Now we will move on to train our model as we are done with our final training
data.

3. Training Skip-gram Model:

Multi-Word skip-gram Model with window-size = 1


Model Architecture of Skip-gram:

Now here we are trying to predict two surrounding words that is why number
of nodes in output layer is two. It can be any number depending on how many
surrounding word you are trying to predict. In above picture 1st output node
represents previous word (t-1) and 2nd output node represents after word
(t+1) of given input word.
The number of nodes in each output layer (u11 to u15and u21 to u25) are same as
the number of nodes in the input layer (count of unique vocabulary in our case
is 5). This is because we are dealing with each word in the vocabulary for each
context.

Applications of Skip-gram:

1. Information retrieval: Skip-gram can be used to generate word


embeddings that are used in information retrieval systems to match
queries with documents.

2. Recommendation systems: Skip-gram can be used to generate word


embeddings that are used in recommendation systems to suggest similar
products or services.

3. Text classification: Skip-gram can be used to generate word


embeddings that are used in text classification tasks, such as spam
detection, topic classification, and sentiment analysis.

4. Word sense disambiguation: Skip-gram can be used to generate word


embeddings that are used in word sense disambiguation tasks, which
involves identifying the correct meaning of a word in context.

You might also like