0% found this document useful (0 votes)
25 views

NLP Notes

Uploaded by

KEVIN KUMAR
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

NLP Notes

Uploaded by

KEVIN KUMAR
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

C5: Word Embeddings and Distance Measurements for Text

The relationship between the word and its neighbourhood tends to define the semantics of a
word and its overall positioning and presence in a sentence.
Word embedding is a learned representation of a word wherein each word is represented
using a vector in n-dimensional space.
Word2vec captures relationships in text.
vector (Man) + vector (Queen) = vector (King) + vector (Woman)
The thought process here is that the relationship of Man:King is the same as Woman:Queen.
The Word2vec algorithm is able to capture these semantic relationships.
The values or vectors obtained from the simple mathematics discussed in previous chapters
are not exactly equal to the actual vector representation of the words.
Word2Vec is a model that enables the building of word vectors using contextual information
from the neighbourhood of a word. For every word whose embedding is developed, it's based
on the words around it.
Word2vec is an unsupervised methodology for building word embeddings. In the Word2vec
architecture, an attempt is made to do either of the following:

 Predict the target word based on the context word


 Predict the context word based on the target word
The Word2vec algorithm tries to capture relationships between words in the text corpus.
The output of the Word2vec algorithm is a |V| * D matrix, where |V| is the size of the
vocabulary we want vector representations for and D is the number of dimensions used to
represent each word vector.
There is a pretrained, globally available Word2vec model that Google trained on Google
News dataset. It has a vocabulary size of 3 million words and phrases and each vector has
300 dimensions.
Word2vec models can be trained by two approaches,

 Predicting the context word using the target word as input, which is referred to as the
Skip-gram method
 Predicting the target word using the context words as input, which is referred to as the
Continuous Bag-of-Words (CBOW) method
Skip Gram Method

Say we have the sentence “Let us make an effort to understand natural language processing”
Every row in the preceding graph has one word shaded in brown. This word represents the
target word. Each row also has some words shaded in gray. These words represent the
context words for the corresponding target word. As you will have guessed, the
window_size value used here is 5.

CBOW method

The CBOW method works similarly to the Skip-gram method. However, the change here is
that the vector corresponding to the context word is sent in as input and the model tries to
predict the target word.

The methods we discussed previously are computationally expensive since all the weights or
entries in the embedding and context matrix are updated for each target word, context word
or context word, target word pair. Mikolov et al. addressed this problem by employing two
strategies—subsampling and negative sampling.

The size of the vocabulary is equal to the number of unique words in the sentences we have
defined.
Higher-dimensional vectors capture more information across dimensions, especially when
the corpus and vocabulary are big and the data is highly varied.

“The model is as good as the data it was trained on.”

Word Movers Distance


WMD computes the pairwise Euclidean distance between words across the sentences and it
defines the distance between two documents as the minimum cumulative cost in terms of
the Euclidean distance required to move all the words from the first sentence to the second
sentence.

C6: Exploring Sentence-, Document-, and Character-Level Embeddings

Similar to Word2Vec, the idea here is to predict certain words as well. However, in addition
to using word representations for predicting words, as we did in the Word2Vec model,
here, document representations are used as well.

These documents are represented using dense vectors, similar to how we represent words.
The vectors are called document or paragraph vectors and are trained to predict words in
the document.
Similar to Word2Vec, Doc2Vec also falls under the class of unsupervised algorithms since
the data that's used here is unlabeled.
The paper described two ways of building paragraph vectors, as follows:

Distributed Memory Model of Paragraph Vectors (PV-DM): This is similar to


the continuous bag-of-words approach we discussed regarding Word2Vec.
Paragraph vectors are concatenated with the word vectors to predict the target
word.

Distributed Bag-of-Words Model of Paragraph Vectors (PV-DBOW): In this


approach, word vectors aren't taken into account. Instead, the paragraph vector
is used to predict randomly sampled words from the paragraph.

The PV-DBOW model is simpler and more memory-efficient as word vectors don't need to
be stored in this approach

The learned representations that are obtained from both the distributed memory model and
the distributed bag-of-words model can be combined to form the paragraph vector.

Building word representations using character n-grams from the words themselves, a
technique referred to as fastText. The fastText model helped us capture morphological
information from sub-word representations. fastText is also flexible as it can provide
embeddings for out-of-vocabulary words since embeddings are a result of sub-word
representations.
The original fastText research paper extended on the Skip-gram approach for Word2Vec,
but today, both the Skip-gram and continuous bag-of-words approach can be used.
fastText can be applied to solve a plethora of problems such as spelling correction, auto
suggestions, and so on since it is based on sub-word representation

fastText is a very convenient technique for building word representations using character
level features. It outperformed Word2Vec since it incorporated internal word structure
information and associated it with morphological features, which are very important in
certain languages.

It also allows us to represent words not present in the original vocabulary.

Sent2Vec combines the continuous bag-of-words approach we discussed regarding


Word2Vec, along with the fastText thought process of using constituent n-gram, to build
sentence embeddings.

Research has shown that Sent2Vec outperforms Doc2Vec in the majority of the tasks it
undertakes and that it is a better representation method for sentences or documents.

Universal Sentence Encoders, which is a very recent technique that has been open sourced by
Google to build sentence or document-level embeddings.

The Universal Sentence Encoder (USE) is a model for fetching embeddings at the sentence
level. These models are trained using Wikipedia, web news, web question-answer pages,
and discussion forums.

Several models that have been built using USE-based transfer learning have outperformed
state-of the- art results in the recent past. USE can be used similar to TF-IDF, Word2Vec,
Doc2Vec, fastText, and so on for fetching sentence-level embeddings.

C7: Identifying Patterns in Text Using Machine Learning

Naive Bayes is a popular ML algorithm based on the Bayes' theorem. The Bayes'
theorem can be represented as follows:

Here, A, B are events:


P(A|B) is the probability of A given B, while P(B|A) is the probability of B given A.
P(A) is the independent probability of A, while P(B) is the independent probability of B.

Naive Bayes, which assumes that all the features are independent of each other, so the joint
probability is simply the product of independent probabilities. This assumption is naive
because it is almost always wrong. Even in our example, an applicant having a high SAT
score is more likely to have a high GPA, so these two events are not independent. However,
the Naive Bayes assumption has been proved to work well for classification problems.

Sentiment analysis, sometimes called opinion mining or polarity detection, refers to the set
of algorithms and techniques that are used to extract the polarity of a given document; that
is, it determines whether the sentiment of a document is positive, negative, or neutral.
SVM is a supervised ML algorithm that attempts to classify data within a dataset by
finding the optimal hyperplane that best segregates the classes

Each data point in the dataset can be considered a vector in an N-dimensional plane, with
each dimension representing a feature of the data. SVM identifies the frontier data points (or
points closest to the opposing class), also known as support vectors, and then attempts to find
the
boundary (also known as the hyperplane in the N-dimensional space) that is the farthest
from the support vector of each class.

Say we have a fruit basket with two types of fruits in it and we want to create an algorithm
that segregates them. We only have information about two features of the fruits; that is,
their weight and radius. Therefore, we can abstract this problem as a linear algebra
problem, with each fruit representing a vector in a two-dimensional space, as shown in the
following diagram. In order to segregate the two types of fruit, we will have to identify a
hyperplane (in two dimensions, the hyperplane would be a line) whose equation can be
represented as follows:

Here, w1 and w2 are coefficients and c is a constant. The equation of the hyperplane in the n
dimension can be generalized as follows:

The algorithm creates a number of hyperplanes and repeats this calculation to identify the
hyperplane that's the most equidistant from both support vectors.

Pickling in Python refers to serializing and deserializing Python object structures. In other
words, by using the module, we can Pickle save the Python objects that are created as part of
model training for reuse.

C8: From Human Neurons to Artificial Neurons for Understanding Text

The evolution of high-end processors in the form of graphical processing units (GPUs) and
tensor processing units (TPUs) has supplemented the rise of neural network-based
applications by making it possible to perform heavy calculations that are very commonly
encountered in any neural network.

Activation functions play a key role in transforming the system of linear equations to a
nonlinear construct (complex nonlinear decision boundaries).

Activation functions introduce nonlinearity in the network. Without nonlinearity, the network
would be performing linear mappings between the input, which would be nothing but a
multivariate linear equation

There are techniques for initializing weight matrices, such as Xavier initialization, that result
in better results than randomly initialized weight matrices.
A penalty term is added to the loss function that will take care of preventing overfitting by
incorporating regularization. The popular forms of regularization include L1 or Lasso, L2 or
Ridge, and Elastic Net.

Dropout is another very commonly used and effective form of regularization that helps
prevent overfitting in neural networks

In addition to L1, L2, Elastic Net, and dropout, another technique that helps in preventing
dropout is early stopping.

The API in Keras provides us with a mechanism for evaluating the evaluate() performance of
our model on test data and the API helps make predictions for predict() new data.

C9: Applying Convolutions to Text

CNNs try to capture the spatial relationships in data.

If we have an image of n x n and a filter of size f x f, then the output would be a matrix of
dimensions, as follows:
(n-f+1)*(n-f+1)

The pooling operation helps in downsampling the data so that only relevant information is
preserved.

Another thing you may have realized is that even if the data shifts somewhat, pooling allows
us to capture the information we need, irrespective of where the feature is located in the data.
This property is referred to as spatial invariance.

This pooling technique comes into effect for temporal data primarily. The global max
pooling operation can replace the flatten option in neural networks.

For text data, we look at one-dimensional spatial relationships and leverage the Conv1D layer
for this purpose. This is similar to going through n-grams, wherein there would be overlaps in
consecutive n-gram windows.

C10: Capturing Temporal Relationships in Text

With ANNs, we primarily saw that inputs are independent of one another. With
CNNs, we went one step further and tried to capture spatial relationships in the inputs by
trying to extract patterns across a set of tokens together.

Recurrent Neural Networks (RNNs) help us capture context and temporal relationships in
sequences.

Sentences can be thought of as combinations of words, such that words are spoken over
time in a sequential manner. It is essential to capture this temporal relationship in natural
language data.

With CNNs, we only looked at the immediate proximity of a word.


Every recurrent neuron takes in two inputs —one is the current or external input at that state
and the other is called a hidden state, which is basically an output from the previous state.

One thing to be careful about is that we should not think of these as n different neural
networks. Instead, each of them is a snapshot of the same FNN with parameters shared across
the time steps.

Forward propagation is pretty straightforward in an RNN, whereby an input vector along


with a hidden state vector is taken as input at each time step to produce an output that is
further used as the hidden state for the next time step.

One of the key concepts to understand in RNNs is the process of backpropagation through
time (BPTT).

We had one output for one input. For RNNs each token is an input, and figure 3 shows that
we need not have one output per token but a single output for a group of tokens.

As a result, parameters are shared across the time steps.

How do we backpropagate in this scenario?

Since the parameters are shared across the time steps, the gradient calculated at each of the
time steps would not only be dependent on the computations of the present time step but also
on the previous time steps. Essentially, this can be thought of as the same neurons firing
differently across various points in time.

Why did we sum up the weight corrections at each time step and apply them all at once
instead of making the corrections at each time step?

This is because, during the forward pass at each time step for an input, the weight was the
same. If we computed the gradient at time step t and applied the changes to the weights
there and then, the weights at time step t-1 would be different and the error calculation
would be wrong since, during the forward pass, we had the same weights at every time
step. If we had updated the weights at each time step, we would have simply penalized the
weights while computing the gradient for something it did not do at all.

Sequences need not always be at the word level. Characters can be used as input sequences as
well.
The exploding gradient problem occurs when large error gradients pile up and cause huge
updates to the weights in our network. On the other hand, when the values of these gradients
are too small, they effectively prevent the weights from getting updated in a network. This is
called the vanishing gradient problem.

One technique for preventing the exploding gradient problem is called gradient clipping.
As part of gradient clipping, the gradient is capped at a maximum value.

Different flavors of RNN


One-to-One, One-to-Many, Many-to-One, Many-to-Many (These RNNs can take two forms,
depending on whether the size of the input is equal or not to the size of the output.).

Carrying relationships both ways using bidirectional RNNs:

Let's look at the following two sentences:

The boy named Harry became the greatest wizard.


The boy named Harry became a Duke: the Duke of Sussex.

The first sentence talks about the fictional character Harry Potter created by author J.K.
Rowling, whereas the second sentence talks about Prince Harry from the United Kingdom.
Until we arrive at the word Harry, both the sentences are exactly the same: The boy named
Harry. Using a simple RNN, we cannot infer much about Harry from the words before its
occurrence. Once we see the latter half of the sentence, we know who's being talked about:
the wizard or the prince. It would be good if, using an RNN architecture, we could carry
things from the end as well to infer things at a point in time. Bidirectional neural networks
help us in this situation.

Bidirectional RNNs, as shown in the following figure, are essentially two independent
RNNs such that one of them processes the inputs in the correct time order, whereas the
other processes the inputs in the reverse time order. The outputs for these two networks are
concatenated at every time step. This formation allows a network to have information from
both directions at every time step.

RNN helps capture sequential information and temporal relationships by combining previous
outputs with present inputs.

The major problem associated with RNNs is that they suffer in terms of capturing and
making sense of long-term dependencies.

LSTM cells use the concept of state or memory to retain long-term dependencies. At every
stage, it is decided as to what to keep in memory and what to discard. All this is done using
gates.

The input to an LSTM cell, as with RNNs, is a concatenation of the input for that time step
and the output of the previous time step. These values are passed on to the gates in the
LSTM cell, which are nothing but an FNN along with some form of activation function.
These gates are referred to as the forget gate, input gate, and output gate. The neural
networks in each of these gates get trained and allow the signal to flow through them into
the memory in different amounts. They decide as to what information should be remembered,
forgotten, or discarded at each step.

Forget Gate

The forget gate's job is to decide how much of the information should be removed from
memory.

It is as important to understand what should be forgotten as it is to understand what should be


remembered. Think of the following example:

Leonardo is a good actor. He won at the Oscars. Brad is a good actor too.

Initially, our cell should remember that Leonardo is being talked about. However, as soon
as we arrive in the third sentence, it should now remember that Brad is being talked about
and it should discard information about Leonardo from its memory. Basically, our network
should have the ability to forget long-term dependencies as soon as new dependencies
worth remembering arrive in our data. Forget gates help us exactly with this by allowing
space for new dependencies.

The values from the forget gate are multiplied with the values in the memory cell in order
to maintain only relevant information from the past.

Input Gate

We should next understand what we need to remember and how much of it should be
remembered. This is exactly what the input gate does for us.

Think of the following example:

Ronaldo is a good football player. Messi is another good player.

As soon as we arrive at the second sentence, the forget gate will help us forget about
Ronaldo, but it is the job of the input gate to ensure that we now remember about Messi.

The input gate has two parts, which simultaneously help in figuring out what is to be
remembered and how much of it needs to be remembered. Let's understand the functioning
of the two parts next.

Part 1 in the input gate uses a activation function, to pinpoint which part of the Sigmoid input
values needs to be remembered by creating a sort of a mask with values between 0 and 1. A
value of would indicate that nothing is worth remembering from the inputs of 0 this state,
whereas a value of 1 would indicate everything from this input state must be remembered.

Part 2 uses a tanh activation function to help us figure out what is potentially the relevant
information from the present state that the memory cell can get updated with. This part is
also often referred to as the candidate vector since this vector holds the values that the
memory cell might get updated with. The output ranges between -1 and 1 from this FNN.

An element-wise multiplication is performed between the outputs from part 1 and part 2.
Essentially, what we did is we understood how relevant various components of part 2 are
based on the values from part 1. The resultant output is added to the memory vector, thus
updating the information in the memory cell.

Output Gate

The job of the output gate is to understand which bits of information in the current step
should be sent across as output from the cell.

There are two things that happen at this stage in the LSTM cell, as follows:
1. First, the output gate receives the input that was received by the LSTM cell initially, and
these inputs are applied to the FNN in the output gate. Thereafter, the activation function is
applied to the computed values to bring the sigmoid output in the range of 0 to 1.

2. Second, the memory at this juncture is already updated based on what should have been
forgotten and what should have been remembered from the computations performed at the
forget gate and input gate stages. This memory state is now passed through a tanh activation
function at this stage to bring the values between -1 and 1.

Finally, the tanh-applied values from memory along with the sigmoid-applied values from
the output gate are multiplied element-wise to get the final output from this LSTM cell in
the network. This value can be taken as output and can also be sent across as the hidden
state for the next LSTM time step.

The backpropagation in LSTMs works similarly to RNNs. However, unlike RNNs, we don't
encounter the problem of vanishing or exploding gradients, wherein the gradients either
become exceedingly small or large. It is primarily because of the memory component we
introduced in LSTMs.

GRUs

LSTMs are huge networks and they have a lot of parameters. Consequently, we need to
update a lot of parameters that are highly computationally expensive.

GRUs use only two gates instead of three, as we used in LSTMs. They combine the forget
gate and the candidate-choice part in the input gate into one gate, called the update gate.
The other gate is the reset gate, which decides how the memory should get updated with
the newly computed information. Based on the output of these two gates, it is decided what
to send across as the output from this cell and how the hidden state is to be updated. This is
done via using something called a content state, which holds the new information. As a
result, the number of parameters in the network is drastically reduced.

Stacked LSTMs

Stacked LSTMs follow an architecture similar to deep RNNs. During the discussion on deep
RNNs, we mentioned that stacking RNN layers one above the other helps the network
capture highly complex patterns and relationships. The same idea is used when building
stacked LSTMs, which can help us capture highly complex patterns from data. Each LSTM
layer in a stacked LSTM model has its own gates and memory vector. Stacked LSTMs are
very expensive in terms of computational requirements.

C11: State of the Art in NLP

You might also like