NLP Using Deep Learning Handson
NLP Using Deep Learning Handson
Word to Vec
Learning good word embeddings are of paramount importance in NLP.
A good word embedding is one which can represent a word in minimum vector space and
at the same time preserve the semantics as well as their context in the language.
Each word embeddings are points in a vector space, and the transformation of words
to their vector representation is called wordtovec.
CBOW
Consider we have a sentence Judith feigned a forgotten wallet to evade paying for
dinner, proving she had surpassed frugality and become parsimonious.
If you are not sure about the meaning of the final word parsimonious you tend to
look at the words surrounding it to guess its meaning (or context).
By looking at the words evade, paying, frugality (avoiding waste) parsimonious
might mean spending less.
CBOW learns the meaning of a given word (or numeric vector) by looking at a fixed
number of words front and behind the word of interest, or in other words, it learns
the context.
The main idea of CBOW model is to predict a word given its context.
Skip Gram
Skip gram model is quit different from CBOW model.
The idea of skip gram model is to predict the context given the word
For a given word skip gram model tries to predict most probable words that usually
surrounds it.
Global Vectors
GLoVe model is slightly different from word to vec model.
GLoVe model learns to build word embeddings by looking at the number of times the
two words have appeared together which we call it as co-occurrence.
It tries to minimize the difference between the similarity of two-word vectors and
their co-occurrence.
Let's say that we are trying to learn the word embedding for the word thrones in
the above sentence.
Here the target word is thrones i.e the word for which we are trying to find the
embedding.
The context words for the word thrones are **game **, of , you, win i.e. the two
words before and after the target word provided the window size is 2.
In other words, the words which usually surround the target words becomes the
context word
The words sharing similar context share similar meaning (or similar word vector
representations).
Sampling
Once we have initialized the lookup table, it's time to sample the target and
context words as shown in the above image.
They are similar input feature and target labels what you see in supervised
learning.
Skip-Gram Drawback
One main disadvantage of skip-gram model is that for each of the training sample,
it sees the model updates all the weights in every iteration.
If the vocabulary size is large, this could be computationally expensive.
To address this issue, there is another technique callednegative sampling, where
only a small fraction of weights are updated for each training sample.
Swipe next to know more on negative sampling.
Negative Sampling
In negative sampling, we generate a set of positive and negative samples from the
available text as shown in the figure.
Each sample will have a binary target value that says whether the two words appear
in the context or not.
The number of negative samples K can be arbitrarily chosen. Larger the corpus,
smaller the value of K (usually around 5).
-------------
Which of the following model tries to predict the target word based on the context?
CBOW model
Similar words tend to have similar word embeddings representations. Depends on the
corpus that is used to train
Which of the following model needs fewer training samples to learn the word
embeddings? Negative sampling
Which of the following activations is used in the CBOW model in its final layer to
learn word embeddings? Softmax activation
For the window size two, what would be the maximum number of target words that can
be sampled for skip gram model? 4
Which of the following option is the drawback of representing text as one hot
encoding? No contextual relationship
Which layer of the Skip-gram model has an actual word embedding representation?
Hidden layer
---------------
Sample Code
from gensim.models import Word2Vec
# train model
model = Word2Vec(sentences, min_count=1, size = 10)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
---------------
glove_file.close()
2)
score = np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
3)
best_word = None
w1= word_to_vec[word_1]
w2= word_to_vec[word_2]
w3= word_to_vec[word_3]
words = word_to_vec.keys()
max_cosine_sim = -100
for w in words:
if w in [word_1, word_2, word_3] :
continue
---------------
Which of the following algorithm takes into account the global context of the word
to generate word vectors? GloVe model
Which of the following metrics uses the dot product of two vectors to determine the
similarity? Cosine distance
Which of the following criteria is used by GloVe model to learn the word
embeddings? Reduce the difference between the similarity of two-word vectors and
their co-occurrence value
Which of the following models use co-occurrence matric to generate word vectors?
GloVe model
Which of the following model learns the word embeddings based on the co-occurrence
of the words in the corpus? GloVe model
Which of the following function in Keras is used to add the embedding layer to the
model? Keras.layers.Embedding()
Which of the following is the constructor used in gensim to generate word vectors?
Word2Vec()
-----------------
Data Preprocessing
When it comes to text data, we first remove all kinds of stop words, if necessary,
and then transform each character or words into one hot encoding.
Keras framework has a built-in class called Tokenizer which performs implicit
tokenization and indexing of each words in the document.
It also eliminates special characters in the document.
### collection of text (or corpus)
docs = ["not good", "climax was awesome !", "really liked the movie", "too
lengthy"]
t = Tokenizer()
###Output the dictionary having word as key and their unique index as values
t.word_index
###Output the dictionary having word as key and number of documents it has appeared
as values
t.word_docs
Lookup Table
Now each word in text data is replaced by their respective index.
To train the LSTM model, you will not directly input the word index to the LSTM
network.
We first initialize the lookup table of shape (vocab_size, vector_length).
We do this in KerasEmbedding class as follows.
from keras.layers import Embedding
Transform Data
Once you have the unique index for each word in the corpus, the corpus has to be
represented as an array of an index in place of words as shown below.
word_to_id = { the: 0, awesome: 1, movie: 2, good:3, was:4 }
Sequence Padding
The length of the movie review is not always determined, it can be too short or too
long.
The model may take a very long time to train if the text data is too long.
For all the reviews we may consider only first few words say 500.
If the text is less than 500words we zeros, in the beginning, to make up the length
to 500 words.
from keras.preprocessing import sequence
max_review_length = 500
sequence.pad_sequences(transformed_data, maxlen=max_review_length)
------------------
import numpy as np
vocab_size = 5000
np.load.__defaults__=(None, True, True, 'ASCII')
(X_train, Y_train), (X_test, Y_test) = imdb.load_data(num_words=vocab_size)
np.load.__defaults__=(None, False, True, 'ASCII')
word_to_id = get_word_index()
word_to_id["PAD"] = 0
word_to_id["START"] = 1
word_to_id["UNK"] = 2
-------------
What is meant by beam width in Beam search algorithm? The maximum number of words
to be sampled at a time by decoder
What is the tradeoff between the greedy search and beam search algorithms? Not
Computation time