0% found this document useful (0 votes)
7 views7 pages

NLP Using Deep Learning Handson

The document discusses various methods of representing words in numeric formats, focusing on one-hot encoding and word embeddings such as CBOW, Skip-gram, and GLoVe models. It explains the process of learning word embeddings, including initializing weights, defining context and target words, and the concept of negative sampling to improve efficiency. Additionally, it covers data preprocessing techniques for text data, including tokenization and sequence padding for training models in NLP tasks.

Uploaded by

Gurram Anurag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views7 pages

NLP Using Deep Learning Handson

The document discusses various methods of representing words in numeric formats, focusing on one-hot encoding and word embeddings such as CBOW, Skip-gram, and GLoVe models. It explains the process of learning word embeddings, including initializing weights, defining context and target words, and the concept of negative sampling to improve efficiency. Additionally, it covers data preprocessing techniques for text data, including tokenization and sequence padding for training models in NLP tasks.

Uploaded by

Gurram Anurag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 7

One Hot Encoding

One naive way of representing a word in numeric format is by one-hot encoding.


As shown in the image initially, all the words in the vocabulary are stored as a
set and assigned a unique index to each word.
Later each word is represented as a vector, where all the elements are zeros except
for the index of the word which is equal to one.

Word to Vec
Learning good word embeddings are of paramount importance in NLP.
A good word embedding is one which can represent a word in minimum vector space and
at the same time preserve the semantics as well as their context in the language.
Each word embeddings are points in a vector space, and the transformation of words
to their vector representation is called wordtovec.

Learning Word Embeddings


There are several ways of arriving at word vectors using deep learning and some of
the popular methods are
Continuous Bag of words
Skipgram model
Glovec model

CBOW
Consider we have a sentence Judith feigned a forgotten wallet to evade paying for
dinner, proving she had surpassed frugality and become parsimonious.
If you are not sure about the meaning of the final word parsimonious you tend to
look at the words surrounding it to guess its meaning (or context).
By looking at the words evade, paying, frugality (avoiding waste) parsimonious
might mean spending less.
CBOW learns the meaning of a given word (or numeric vector) by looking at a fixed
number of words front and behind the word of interest, or in other words, it learns
the context.
The main idea of CBOW model is to predict a word given its context.

Skip Gram
Skip gram model is quit different from CBOW model.
The idea of skip gram model is to predict the context given the word
For a given word skip gram model tries to predict most probable words that usually
surrounds it.

Global Vectors
GLoVe model is slightly different from word to vec model.
GLoVe model learns to build word embeddings by looking at the number of times the
two words have appeared together which we call it as co-occurrence.
It tries to minimize the difference between the similarity of two-word vectors and
their co-occurrence.

Initializing Word Embeddings


Before training any neural network, one always starts with initializing the weights
and bias parameters to random values.
Likewise, before learning the word embeddings, the embedding vectors for each of
the words in the vocabulary are initialized with random values.
The word embeddings look like the table shown above. This table is commonly known
as look up table.
The length of each word embedding is of N dimension, in the above table N = 3

The Context and the Target


Before we employ a neural network to learn word embeddings, it is necessary to fix
the context and target words.
Consider the sentence
When you play a game of thrones you win, or you die.

Let's say that we are trying to learn the word embedding for the word thrones in
the above sentence.

Here the target word is thrones i.e the word for which we are trying to find the
embedding.

The context words for the word thrones are **game **, of , you, win i.e. the two
words before and after the target word provided the window size is 2.

In other words, the words which usually surround the target words becomes the
context word

The words sharing similar context share similar meaning (or similar word vector
representations).

Sampling
Once we have initialized the lookup table, it's time to sample the target and
context words as shown in the above image.
They are similar input feature and target labels what you see in supervised
learning.

The Skip Gram Model


The figure shows the architecture of skip gram model.
The embedding matrix is the lookup table whose weights are taken as word vectors.

Extracting Word Embeddings


After several iterations of forward-pass -> computing loss -> update weights the
updated weights of the embedding matrix of initial hidden layer are the final word
embeddings.
The weights corresponding to softmax layer can be discarded.

Skip-Gram Drawback
One main disadvantage of skip-gram model is that for each of the training sample,
it sees the model updates all the weights in every iteration.
If the vocabulary size is large, this could be computationally expensive.
To address this issue, there is another technique callednegative sampling, where
only a small fraction of weights are updated for each training sample.
Swipe next to know more on negative sampling.

Negative Sampling
In negative sampling, we generate a set of positive and negative samples from the
available text as shown in the figure.
Each sample will have a binary target value that says whether the two words appear
in the context or not.
The number of negative samples K can be arbitrarily chosen. Larger the corpus,
smaller the value of K (usually around 5).

Negative Sampling Architecture


The figure shows the negative sampling model architecture.
The softmax layer of Skip gram model is replaced by sigmoid activation.

-------------

Which of the following model predicts if a word is a context of another word or


not? Negative sampling
Which of the following model tries to predict the context word based on the target?
Skip gram model

Which of the following model tries to predict the target word based on the context?
CBOW model

Similar words tend to have similar word embeddings representations. Depends on the
corpus that is used to train

Which of the following option/options is/are the advantages of learning word


embeddings? All

Which of the following model needs fewer training samples to learn the word
embeddings? Negative sampling

Which of the following activations is used in the CBOW model in its final layer to
learn word embeddings? Softmax activation

For the window size two, what would be the maximum number of target words that can
be sampled for skip gram model? 4

Which of the following option is the drawback of representing text as one hot
encoding? No contextual relationship

Which layer of the Skip-gram model has an actual word embedding representation?
Hidden layer

---------------

Sample Code
from gensim.models import Word2Vec

# define training data


sentences = [['gensim', 'is', 'billed','as', 'a', 'natural', 'language',
'processing', 'package'],
['but', 'it', 'is', 'practically', 'much', 'more', 'than' ,'that'],
['It', 'is', 'a', 'leading', 'and', 'a', 'state', 'of', 'the', 'art',
'package',
'for', 'processing', 'texts', 'working' 'with' 'word' 'vector'
'models']]

# train model
model = Word2Vec(sentences, min_count=1, size = 10)

# summarize the loaded model


print(model)

# summarize vocabulary
words = list(model.wv.vocab)
print(words)

# access vector for one word


print(model['gensim'])

---------------

Hands On - Analogy Completion


1)
word_to_vec = dict()
glove_file = open(file_name)

for line in glove_file:


records = line.split()
word = records[0]
vector_dimensions = np.array(records[1:], dtype='float32')
word_to_vec [word] = vector_dimensions

glove_file.close()

2)
score = np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

3)
best_word = None
w1= word_to_vec[word_1]
w2= word_to_vec[word_2]
w3= word_to_vec[word_3]

words = word_to_vec.keys()
max_cosine_sim = -100

for w in words:
if w in [word_1, word_2, word_3] :
continue

cosine_sim = cosine_similarity((w2 - w1), (word_to_vec[w] - w3))

if cosine_sim > max_cosine_sim:


max_cosine_sim = cosine_sim
best_word = w

---------------

Which of the following algorithm takes into account the global context of the word
to generate word vectors? GloVe model

Which of the following metrics uses the dot product of two vectors to determine the
similarity? Cosine distance

Which of the following criteria is used by GloVe model to learn the word
embeddings? Reduce the difference between the similarity of two-word vectors and
their co-occurrence value

Which of the following values is passed to sg parameter of gensim Word2Vec() to


generate word vectors using skip gram model? 0

Which of the following values is passed to sg parameter of gensim Word2Vec() to


generate word vectors using CBOW gram model? 1

Which of the following models use co-occurrence matric to generate word vectors?
GloVe model

Which of the following model learns the word embeddings based on the co-occurrence
of the words in the corpus? GloVe model
Which of the following function in Keras is used to add the embedding layer to the
model? Keras.layers.Embedding()

Which of the following is the constructor used in gensim to generate word vectors?
Word2Vec()

-----------------

Data Preprocessing
When it comes to text data, we first remove all kinds of stop words, if necessary,
and then transform each character or words into one hot encoding.
Keras framework has a built-in class called Tokenizer which performs implicit
tokenization and indexing of each words in the document.
It also eliminates special characters in the document.
### collection of text (or corpus)
docs = ["not good", "climax was awesome !", "really liked the movie", "too
lengthy"]

from keras.preprocessing.text import Tokenizer

t = Tokenizer()

### Perform transformation


t.fit_on_texts(docs)

###Output the number of documents in the corpus


t.document_count

###Output the number of occurrence of each word across the document


t.word_counts

###Output the dictionary having word as key and their unique index as values
t.word_index

###Output the dictionary having word as key and number of documents it has appeared
as values
t.word_docs

Lookup Table
Now each word in text data is replaced by their respective index.
To train the LSTM model, you will not directly input the word index to the LSTM
network.
We first initialize the lookup table of shape (vocab_size, vector_length).
We do this in KerasEmbedding class as follows.
from keras.layers import Embedding

embedding_layer = Embedding(vocab_size, vector_length)

Transform Data
Once you have the unique index for each word in the corpus, the corpus has to be
represented as an array of an index in place of words as shown below.
word_to_id = { the: 0, awesome: 1, movie: 2, good:3, was:4 }

data = [["the movie was awesome"]]


transformed_data = [[0, 2, 4, 1]]

Sequence Padding
The length of the movie review is not always determined, it can be too short or too
long.
The model may take a very long time to train if the text data is too long.
For all the reviews we may consider only first few words say 500.
If the text is less than 500words we zeros, in the beginning, to make up the length
to 500 words.
from keras.preprocessing import sequence
max_review_length = 500
sequence.pad_sequences(transformed_data, maxlen=max_review_length)

------------------

Hands On - Sentiment Classification

import numpy as np
vocab_size = 5000
np.load.__defaults__=(None, True, True, 'ASCII')
(X_train, Y_train), (X_test, Y_test) = imdb.load_data(num_words=vocab_size)
np.load.__defaults__=(None, False, True, 'ASCII')
word_to_id = get_word_index()

word_to_id["PAD"] = 0
word_to_id["START"] = 1
word_to_id["UNK"] = 2

from keras_preprocessing import sequence


X_train_pad = sequence.pad_sequences(X_train, maxlen=500, padding='pre',value=0)
X_test_pad = sequence.pad_sequences(X_test, maxlen=500, padding='pre',value=0)

from keras.models import Sequential


from keras.layers import Embedding
from keras.layers.core import Dense, Dropout
from keras.layers.recurrent import LSTM
embedding_vector_length = 32
model = Sequential()
model.add(Embedding(5000,embedding_vector_length))
model.add(LSTM(100))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train_pad, Y_train, validation_data=(X_test_pad, Y_test),


epochs=3,batch_size=64)

-------------

What is meant by beam width in Beam search algorithm? The maximum number of words
to be sampled at a time by decoder

What is the tradeoff between the greedy search and beam search algorithms? Not
Computation time

The functionality of encoder in an encoder-decoder network for machine translation


is __________. Not To generate translated words
Which of the following constructor is used to tokenize and assign unique index to
words in keras? from keras.preprocessing.text.Tokenizer()

You might also like