0% found this document useful (0 votes)
22 views54 pages

NLP - Module 2

The document outlines the NLP pipeline, focusing on data acquisition, text extraction, and pre-processing steps necessary for machine learning applications. It discusses techniques for cleaning and normalizing text, such as tokenization, stop word removal, stemming, and lemmatization, as well as various text representation methods including Bag of Words and n-grams. Additionally, it highlights the importance of distributed representations and embeddings in capturing semantic meaning and context in natural language processing.

Uploaded by

udayvalmiki71
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views54 pages

NLP - Module 2

The document outlines the NLP pipeline, focusing on data acquisition, text extraction, and pre-processing steps necessary for machine learning applications. It discusses techniques for cleaning and normalizing text, such as tokenization, stop word removal, stemming, and lemmatization, as well as various text representation methods including Bag of Words and n-grams. Additionally, it highlights the importance of distributed representations and embeddings in capturing semantic meaning and context in natural language processing.

Uploaded by

udayvalmiki71
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

SUBJECT CODE BY

22AI632 RACHEL E C
BITM, BALLARI
MODULE – 2
NLP PIPELINE
Data Acquisition
 Data is the heart of any ML system.
 For this, we need labeled data, a collection of queries where each
one is labeled
 How can we get such data?
 Use a public dataset
 Scrape data
 Product intervention
 Data augmentation
 Synonym replacement
 Back translation
 TF-IDF–based word replacement
 Bigram flipping
 Replacing entities
 Adding noise to data
Text Extraction and Cleanup
 Text extraction and cleanup refers to the process of extracting raw
text from the input data by removing all the other non-textual
information, such as markup, metadata, etc., and converting the text
to the required encoding format.
 Text extraction is a standard data-wrangling step, and we don’t
usually employ any NLP-specific techniques during this process.
 Clean text is human language rearranged into a format that
machine models can understand.

Step 1: Lowercase / Uppercase


It helps to maintain the consistency flow during the NLP tasks and
text mining. The lower() function makes the whole process quite
straightforward.
Step 2 : Punctuation Removal
In this step we will be removing all punctuations ,because the
punctuation to the sentence adds up noise that brings ambiguity
while training the model.

Step 3: HTML Code and URL Links


we can simply use the following code :
text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE) text
= re.sub('<.*?>+', '', text)

Step 4: Spell Checks


Incoming text data often has spelling errors. This can be prevalent in
search engines, text-based chatbots deployed on mobile devices,
social media, and many other sources.
Shorthand typing: Hllo world! I am back!
Fat finger problem [20]: I pronise that I will not bresk the silence
again!
System- Specific Error Correction
 Text extraction from scanned documents is typically done through
optical character recognition (OCR).

 One approach is to run the text through a spell checker such as


pyenchant , which will identify misspellings and suggest some
alternatives.
 Neural network architectures to train word/character-based
language models, which are in turn used for correcting OCR text
output based on the context
Pre-Processing

Step 5 : Tokenization
Tokenizing is like splitting a whole sentence into words. You can
consider a simple separator for this purpose.

Step 6 : Removing Stop Words


For instance, "a," "our," "for," "in," etc. are in the set of most
commonly used words. Removing these words helps the model to
consider only key features.
These words also don't carry much information. By eliminating them,
data scientists can focus on the important words.

Step 7 : Normalization
Normalization is an advanced step in cleaning to maintain uniformity.
It brings all the words under on the roof by adding stemming and
lemmatization.
Stemming
There are many variations of words that do not bring any new
information and create redundancy, ultimately bringing ambiguity
when training machine learning models for predictions.
for example, "He likes to walk" and "He likes walking“.
Lemmatization
Lemmatization performs normalization using vocabulary and
morphological analysis of words.
Lemmatization aims to remove inflectional endings only and to
return the base or dictionary form of a word, which is known as the
lemma.
Lemmatization is built on WordNet's built-in morphy function,
making it an intelligent operation for text analysis.
NLP PIPELINE
Text Representation
 Feature extraction is an important step for any machine learning
problem.
 How do we transform a given text into numerical form so that it can
be fed into NLP and ML algorithms?
 In NLP parlance, this conversion of raw text to a suitable numerical
form is called text representation.
 Feature representation is a common step in any ML project, whether
the data is text, images, videos, or speech.
 However, feature representation for text is often much more involved
as compared to other formats of data.
 The way an image is stored in a computer is in the form of a matrix of
pixels where each cell[i,j] in the matrix represents pixel i,j of the image.
Simple approaches is state-of-the-art techniques for representing
text.
These approaches are classified into four categories:
 Basic vectorization approaches
 Distributed representations
 Universal language representation
 Handcrafted features

 Given a labeled text corpus and to build a sentiment analysis model.


 To correctly predict the sentiment of a sentence, the model needs to
understand the meaning of the sentence.
In order to correctly extract the meaning of the sentence, the most
crucial data points are:
1. Break the sentence into lexical units such as lexemes, words, and
phrases
2. Derive the meaning for each of the lexical units
3. Understand the syntactic (grammatical) structure of the sentence.
4. Understand the context in which the sentence appears

 The semantics (meaning) of the sentence arises from the


combination of the above points.
 Thus, any good text representation scheme must facilitate the
extraction of those data points in the best possible way to reflect
the linguistic properties of the text.
Vector Space Models
 Text units (characters, phonemes, words, phrases, sentences,
paragraphs, and documents) with vectors of numbers.
 This is known as the vector space model (VSM)
 It’s a simple algebraic model used extensively for representing any
text blob.
 VSM is fundamental to many information-retrieval operations,
from scoring documents on a query to document classification and
document clustering.
 It’s a mathematical model that represents text units as vectors.
 In the simplest form, these are vectors of identifiers, such as index
numbers in a corpus vocabulary.
 The most common way to calculate similarity between two text
blobs is using cosine similarity: the cosine of the angle between their
corresponding vectors.
 The cosine of 0° is 1 and the cosine of 180° is –1, with the cosine
monotonically decreasing from 0° to 180°.
Given two vectors, A and B, each with n components, the similarity
between them is computed as follows:

where Ai and Bi are the i th components of vectors A and B,


respectively. Sometimes, people also use Euclidean distance between
vectors to capture similarity
Basic Vectorization Approaches
Let’s start with a basic idea of text representation: map each word in
the vocabulary (V) of the text corpus to a unique ID (integer value),
then represent each sentence or document in the corpus as a V-
dimensional vector.
D1 – Dog bites man
D2 – Man bites dog
D3 – Dog eats meat
D4 – Man eats food
We first map each of the six words to unique IDs: dog = 1, bites = 2,
man = 3, meat = 4 , food = 5, eats = 6.ii Let’s consider the document
D1: “dog bites man”. As per the scheme, each word is a six-
dimensional vector. Dog is represented as [1 0 0 0 0 0], as the word
“dog” is mapped to ID 1. Bites is represented as [0 1 0 0 0 0], and so
on and so forth. Thus, D1 is represented as [ [1 0 0 0 0 0] [0 1 0 0 0 0]
[0 0 1 0 0 0]]. D4 is represented as [ [ 0 0 1 0 00] [0 0 0 0 0 1] [0 0 0 0
1 0]].
Shortcomings
1. Over fitting –The size of a one-hot vector is directly proportional
to size of the vocabulary, and most real-world corpora have large
vocabularies.
2. This representation does not give a fixed-length representation
for text - feature vectors to be of the same length.
3. It treats words as atomic units and has no notion of (dis)similarity
between words.
4. out of vocabulary (OOV) problem – man eats fruits, training data
doesn't include fruit.
Bag of Words
 Bag of words (BoW) is a classical text representation technique that
has been used commonly in NLP, especially in text classification
problems.
 The key idea behind it is as follows: represent the text under
consideration as a bag (collection) of words while ignoring the order
and context.
 The basic intuition behind it is that it assumes that the text
belonging to a given class in the dataset is characterized by a unique
set of words.
 If two text pieces have nearly the same words, then they belong to
the same bag (class).
 Thus, by analyzing the words present in a piece of text, one can
identify the class (bag) it belongs to.
D1 – Dog bites man
D2 – Man bites dog
D3 – Dog eats meat
D4 – Man eats food
where the word IDs are dog = 1, bites = 2, man = 3, meat = 4 , food =
5, eats = 6, D1 becomes [1 1 1 0 0 0]. This is because the first three
words in the vocabulary appeared exactly once in D1, and the last
three did not appear at all. D4 becomes [0 0 1 0 1 1].

Advantages
 BoW is fairly simple to understand and implement.
 The vector space resulting from the BoW scheme captures the
semantic similarity of documents.
 We have a fixed-length encoding for any sentence of arbitrary
length.
Disadvantages
 The size of the vector increases with the size of the vocabulary.
 It does not capture the similarity between different words that
mean the same thing.
This representation does not have any way to handle out of
vocabulary words (i.e., new words that were not seen in the corpus
that was used to build the vectorizer).
 As the name indicates, it is a “bag” of words—word order
information is lost in this representation.
 Term Frequency - Inverse Document Frequency (TF-IDF) is a widely
used statistical method in natural language processing and
information retrieval.
 It measures how important a term is within a document relative to a
collection of documents (i.e., relative to a corpus).
 Words within a text document are transformed into importance
numbers by a text vectorization process.
S1 – Dog bites man
S2 – Man bites dog
S3 – Dog eats meat
S4 – Man eats food
Different level of analysis required:
morphological analysis – The deep linguistic analysis process that
determines lexical and grammatical features of each token in
addition to the POS. The result of this analysis is a list of Universal
feature.

syntactic analysis – also called as syntax analysis or parsing, is the


process of analyzing NLP with the rules of a formal grammar.

semantic analysis – crucial part of NLP that concentrates on


understanding the meaning, interpretation and relationship between
words, phrases and sentences in a given context.

discourse analysis – extracting the meaning out of the corpus or text.


Process of performing text or language analysis which involves text
interpretation and knowing the social interactions. Very important in
NLP and helps train the NLP model better.
Bag of N - Grams

N-grams are continuous sequences of words or symbols or


tokens in a document and are defined as the neighboring
sequences of items in a document.
 They are used most importantly in tasks dealing with text data in
NLP (Natural Language Processing).
 Given a sequence of N-1 words, an N-gram model predicts the
most probable word that might follow this sequence.
 A model that simply relies on how often a word occurs without
looking at previous words is called unigram.
 If a model considers only the previous word to predict the
current word, then it's called bigram.
 If two previous words are considered, then it's a trigram model.
A statistical language model is the development of probabilistic
models to predict the probability of a sequence of words.
It is able to predict the next word in a sequence given a history
context represented by the preceding words.
The probability that we want to model can be factorized using the
chain rule as follows:

where w0 is a special token to denote the start of the sentence.


In practice, we usually use what is called N-Gram models that use
Markov process assumption to limit the history context. Examples of
N-Grams are:
Training Set:
The Arabian Knights
These are the fairy tales of the east
The stories of the Arabian knights are translated in
many language.
Bigram model:
P(the/<s>)=0.67= 2 /3 P(Arabian/the)=0.4 = 2 /5 P(Knights/Arabian)=1.0= 2 /2
P(are/these)=1.0 P(the/are)=0.5 P(fairy/the)=0.2
P(tales/fairy)=1.0 P(of/tales)=1.0 P(the/of)=1.0
P(east/the)=0.2 P(stories/the)=0.2 P(of/stories)=1.0 P(are/Knights)=1.0
P(translated/are)=0.5 P(in/translated)=1.0 P(many/in)=1.0
P(language/many)=1.0
Test sentence(S) : The Arabian knights are the fairy tales of the east
P(the/<s>)*P(Arabian/the)*P(Knights/Arabian)*P(are/these)*P(the/are)*P(fairy/the)*P(tale
s/fairy) *P(of/tales)*P(the/of)*P(east/the)*P(stories/the)*P(of/stories) *P(are/Knights)*
P(translated/are)* P(in/translated)*P(many/in)*P(language/many)
=0.67*0.4*1.0*1.0*0.5*0.2*1.0*1.0*1.0*0.2 =0.0067
A trigram model generates more natural sentences.
The main pros and cons of BoN:

• It captures some context and word-order information in the form of


n-grams.

• Thus, resulting vector space is able to capture some semantic


similarity.
Documents having the same n-grams will have their vectors closer to
each other in
Euclidean space as compared to documents with completely different
n-grams.

• As n increases, dimensionality (and therefore sparsity) only increases


rapidly.

• It still provides no way to address the OOV problem.


The n-gram model suffers from data sparseness problem

 An n-gram that does not occur in the training data is assigned


zero probability, so that even a large corpus has several zero
entries in the bigram matrix.
 A number of smoothing techniques have been developed to
handle the data sparseness problem.
 The word smoothing is used to denote these techniques.
 Because they tend to make distributions more uniform by moving
the extreme probabilities towards the average
Distributed Representations
 We saw some key drawbacks that are common to all basic
vectorization approaches.
 To overcome these limitations, methods to learn low dimensional
representations were devised.
 They use neural network architectures to create dense, low-
dimensional representations of words and texts.

We need to understand some key terms:

 Distributional similarity
This is the idea that the meaning of a word can be understood from
the context in which the word appears. This is also known as
connotation: meaning is defined by context. This is opposed to
denotation: the literal meaning of any word.
 Distributional hypothesis
In linguistics, this hypothesizes that words that occur in similar
contexts have similar meanings.
According to the distributional hypothesis, there must be a strong
similarity between the meanings of these two words.
If two words often occur in similar context, then their corresponding
representation vectors must also be close to each other.

 Distributional representation
This refers to representation schemes that are obtained based on
distribution of words from the context in which the words appear.
These schemes are based on distributional hypotheses.

 Distributed representation
It is based on the distributional hypothesis.
Distributed representation schemes significantly compress the
dimensionality. This results in vectors that are compact (i.e., low
dimensional) and dense (i.e., hardly any zeros).
 Embedding
For the set of words in a corpus, embedding is a mapping between
vector space coming from distributional representation to vector space
coming from distributed representation.

 Vector semantics
This refers to the set of NLP methods that aim to learn the word
representations based on distributional properties of words in a large
corpus.
Word Embeddings

In 2013, a seminal work by Mikolov et al. showed that their neural net‐
work–based word representation model known as “Word2vec,” based
on “distributional similarity,” can capture word analogy relationships
such as:

King – Man + Woman ≈ Queen


Training our own embeddings

We’ll look at two architectural variants that were proposed in the


original Word2vec approach.

 Continuous bag of words (CBOW)

 Skip Gram

Both of these have a lot of similarities in many respects. We’ll use the
sentence “The quick brown fox jumps over the lazy dog” as our toy
corpus.
CBOW
• The primary task is to build a language model that correctly predicts
the center word given the context words in which the center word
appears.
• The objective of a language model is to assign probabilities in such a
way that it gives high probability to “good” sentences and low
probabilities to “bad” sentences.
• By good, we mean sentences that are semantically and syntactically
correct.
• By bad, we mean sentences that are incorrect—semantically or
syntactically or both.
•For a sentence like “The cat jumped over the dog,” it will try to assign
a probability close to 1.0, whereas for a sentence like “jumped over
the the cat dog,” it tries to assign a probability close to 0.0.
• CBOW tries to learn a language model that tries to predict the
“center” word from the words in its context.
Skip Gram
 In Skip‐ Gram, the task is to predict the context words from the
center word.
 For our toy corpus with context size 2, using the center word
“jumps,” we try to predict every word in context—“brown,” “fox,”
“over,” “the”
 This constitutes one step. Skip Gram repeats this one step for every
word in the corpus as the center word.
 To use both the CBOW and Skip Gram algorithms in practice, there
are several avail‐ able implementations that abstract the
mathematical details for us. One of the most commonly used
implementations is gensim

You might also like