NLP - Module 2
NLP - Module 2
22AI632 RACHEL E C
BITM, BALLARI
MODULE – 2
NLP PIPELINE
Data Acquisition
Data is the heart of any ML system.
For this, we need labeled data, a collection of queries where each
one is labeled
How can we get such data?
Use a public dataset
Scrape data
Product intervention
Data augmentation
Synonym replacement
Back translation
TF-IDF–based word replacement
Bigram flipping
Replacing entities
Adding noise to data
Text Extraction and Cleanup
Text extraction and cleanup refers to the process of extracting raw
text from the input data by removing all the other non-textual
information, such as markup, metadata, etc., and converting the text
to the required encoding format.
Text extraction is a standard data-wrangling step, and we don’t
usually employ any NLP-specific techniques during this process.
Clean text is human language rearranged into a format that
machine models can understand.
Step 5 : Tokenization
Tokenizing is like splitting a whole sentence into words. You can
consider a simple separator for this purpose.
Step 7 : Normalization
Normalization is an advanced step in cleaning to maintain uniformity.
It brings all the words under on the roof by adding stemming and
lemmatization.
Stemming
There are many variations of words that do not bring any new
information and create redundancy, ultimately bringing ambiguity
when training machine learning models for predictions.
for example, "He likes to walk" and "He likes walking“.
Lemmatization
Lemmatization performs normalization using vocabulary and
morphological analysis of words.
Lemmatization aims to remove inflectional endings only and to
return the base or dictionary form of a word, which is known as the
lemma.
Lemmatization is built on WordNet's built-in morphy function,
making it an intelligent operation for text analysis.
NLP PIPELINE
Text Representation
Feature extraction is an important step for any machine learning
problem.
How do we transform a given text into numerical form so that it can
be fed into NLP and ML algorithms?
In NLP parlance, this conversion of raw text to a suitable numerical
form is called text representation.
Feature representation is a common step in any ML project, whether
the data is text, images, videos, or speech.
However, feature representation for text is often much more involved
as compared to other formats of data.
The way an image is stored in a computer is in the form of a matrix of
pixels where each cell[i,j] in the matrix represents pixel i,j of the image.
Simple approaches is state-of-the-art techniques for representing
text.
These approaches are classified into four categories:
Basic vectorization approaches
Distributed representations
Universal language representation
Handcrafted features
Advantages
BoW is fairly simple to understand and implement.
The vector space resulting from the BoW scheme captures the
semantic similarity of documents.
We have a fixed-length encoding for any sentence of arbitrary
length.
Disadvantages
The size of the vector increases with the size of the vocabulary.
It does not capture the similarity between different words that
mean the same thing.
This representation does not have any way to handle out of
vocabulary words (i.e., new words that were not seen in the corpus
that was used to build the vectorizer).
As the name indicates, it is a “bag” of words—word order
information is lost in this representation.
Term Frequency - Inverse Document Frequency (TF-IDF) is a widely
used statistical method in natural language processing and
information retrieval.
It measures how important a term is within a document relative to a
collection of documents (i.e., relative to a corpus).
Words within a text document are transformed into importance
numbers by a text vectorization process.
S1 – Dog bites man
S2 – Man bites dog
S3 – Dog eats meat
S4 – Man eats food
Different level of analysis required:
morphological analysis – The deep linguistic analysis process that
determines lexical and grammatical features of each token in
addition to the POS. The result of this analysis is a list of Universal
feature.
Distributional similarity
This is the idea that the meaning of a word can be understood from
the context in which the word appears. This is also known as
connotation: meaning is defined by context. This is opposed to
denotation: the literal meaning of any word.
Distributional hypothesis
In linguistics, this hypothesizes that words that occur in similar
contexts have similar meanings.
According to the distributional hypothesis, there must be a strong
similarity between the meanings of these two words.
If two words often occur in similar context, then their corresponding
representation vectors must also be close to each other.
Distributional representation
This refers to representation schemes that are obtained based on
distribution of words from the context in which the words appear.
These schemes are based on distributional hypotheses.
Distributed representation
It is based on the distributional hypothesis.
Distributed representation schemes significantly compress the
dimensionality. This results in vectors that are compact (i.e., low
dimensional) and dense (i.e., hardly any zeros).
Embedding
For the set of words in a corpus, embedding is a mapping between
vector space coming from distributional representation to vector space
coming from distributed representation.
Vector semantics
This refers to the set of NLP methods that aim to learn the word
representations based on distributional properties of words in a large
corpus.
Word Embeddings
In 2013, a seminal work by Mikolov et al. showed that their neural net‐
work–based word representation model known as “Word2vec,” based
on “distributional similarity,” can capture word analogy relationships
such as:
Skip Gram
Both of these have a lot of similarities in many respects. We’ll use the
sentence “The quick brown fox jumps over the lazy dog” as our toy
corpus.
CBOW
• The primary task is to build a language model that correctly predicts
the center word given the context words in which the center word
appears.
• The objective of a language model is to assign probabilities in such a
way that it gives high probability to “good” sentences and low
probabilities to “bad” sentences.
• By good, we mean sentences that are semantically and syntactically
correct.
• By bad, we mean sentences that are incorrect—semantically or
syntactically or both.
•For a sentence like “The cat jumped over the dog,” it will try to assign
a probability close to 1.0, whereas for a sentence like “jumped over
the the cat dog,” it tries to assign a probability close to 0.0.
• CBOW tries to learn a language model that tries to predict the
“center” word from the words in its context.
Skip Gram
In Skip‐ Gram, the task is to predict the context words from the
center word.
For our toy corpus with context size 2, using the center word
“jumps,” we try to predict every word in context—“brown,” “fox,”
“over,” “the”
This constitutes one step. Skip Gram repeats this one step for every
word in the corpus as the center word.
To use both the CBOW and Skip Gram algorithms in practice, there
are several avail‐ able implementations that abstract the
mathematical details for us. One of the most commonly used
implementations is gensim