0% found this document useful (0 votes)

22 views54 pages

NLP - Module 2

The document outlines the NLP pipeline, focusing on data acquisition, text extraction, and pre-processing steps necessary for machine learning applications. It discusses techniques for cleaning and normalizing text, such as tokenization, stop word removal, stemming, and lemmatization, as well as various text representation methods including Bag of Words and n-grams. Additionally, it highlights the importance of distributed representations and embeddings in capturing semantic meaning and context in natural language processing.

Uploaded by

udayvalmiki71

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views54 pages

NLP - Module 2

Uploaded by

udayvalmiki71

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

SUBJECT CODE BY

22AI632 RACHEL E C
BITM, BALLARI
MODULE – 2
NLP PIPELINE
Data Acquisition
 Data is the heart of any ML system.
 For this, we need labeled data, a collection of queries where each
one is labeled
 How can we get such data?
 Use a public dataset
 Scrape data
 Product intervention
 Data augmentation
 Synonym replacement
 Back translation
 TF-IDF–based word replacement
 Bigram flipping
 Replacing entities
 Adding noise to data
Text Extraction and Cleanup
 Text extraction and cleanup refers to the process of extracting raw
text from the input data by removing all the other non-textual
information, such as markup, metadata, etc., and converting the text
to the required encoding format.
 Text extraction is a standard data-wrangling step, and we don’t
usually employ any NLP-specific techniques during this process.
 Clean text is human language rearranged into a format that
machine models can understand.

Step 1: Lowercase / Uppercase

It helps to maintain the consistency flow during the NLP tasks and
text mining. The lower() function makes the whole process quite
straightforward.
Step 2 : Punctuation Removal
In this step we will be removing all punctuations ,because the
punctuation to the sentence adds up noise that brings ambiguity
while training the model.

Step 3: HTML Code and URL Links

we can simply use the following code :
text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE) text
= re.sub('<.*?>+', '', text)

Step 4: Spell Checks

Incoming text data often has spelling errors. This can be prevalent in
search engines, text-based chatbots deployed on mobile devices,
social media, and many other sources.
Shorthand typing: Hllo world! I am back!
Fat finger problem [20]: I pronise that I will not bresk the silence
again!
System- Specific Error Correction
 Text extraction from scanned documents is typically done through
optical character recognition (OCR).

 One approach is to run the text through a spell checker such as

pyenchant , which will identify misspellings and suggest some
alternatives.
 Neural network architectures to train word/character-based
language models, which are in turn used for correcting OCR text
output based on the context
Pre-Processing

Step 5 : Tokenization
Tokenizing is like splitting a whole sentence into words. You can
consider a simple separator for this purpose.

Step 6 : Removing Stop Words

For instance, "a," "our," "for," "in," etc. are in the set of most
commonly used words. Removing these words helps the model to
consider only key features.
These words also don't carry much information. By eliminating them,
data scientists can focus on the important words.

Step 7 : Normalization
Normalization is an advanced step in cleaning to maintain uniformity.
It brings all the words under on the roof by adding stemming and
lemmatization.
Stemming
There are many variations of words that do not bring any new
information and create redundancy, ultimately bringing ambiguity
when training machine learning models for predictions.
for example, "He likes to walk" and "He likes walking“.
Lemmatization
Lemmatization performs normalization using vocabulary and
morphological analysis of words.
Lemmatization aims to remove inflectional endings only and to
return the base or dictionary form of a word, which is known as the
lemma.
Lemmatization is built on WordNet's built-in morphy function,
making it an intelligent operation for text analysis.
NLP PIPELINE
Text Representation
 Feature extraction is an important step for any machine learning
problem.
 How do we transform a given text into numerical form so that it can
be fed into NLP and ML algorithms?
 In NLP parlance, this conversion of raw text to a suitable numerical
form is called text representation.
 Feature representation is a common step in any ML project, whether
the data is text, images, videos, or speech.
 However, feature representation for text is often much more involved
as compared to other formats of data.
 The way an image is stored in a computer is in the form of a matrix of
pixels where each cell[i,j] in the matrix represents pixel i,j of the image.
Simple approaches is state-of-the-art techniques for representing
text.
These approaches are classified into four categories:
 Basic vectorization approaches
 Distributed representations
 Universal language representation
 Handcrafted features

 Given a labeled text corpus and to build a sentiment analysis model.

 To correctly predict the sentiment of a sentence, the model needs to
understand the meaning of the sentence.
In order to correctly extract the meaning of the sentence, the most
crucial data points are:
1. Break the sentence into lexical units such as lexemes, words, and
phrases
2. Derive the meaning for each of the lexical units
3. Understand the syntactic (grammatical) structure of the sentence.
4. Understand the context in which the sentence appears

 The semantics (meaning) of the sentence arises from the

combination of the above points.
 Thus, any good text representation scheme must facilitate the
extraction of those data points in the best possible way to reflect
the linguistic properties of the text.
Vector Space Models
 Text units (characters, phonemes, words, phrases, sentences,
paragraphs, and documents) with vectors of numbers.
 This is known as the vector space model (VSM)
 It’s a simple algebraic model used extensively for representing any
text blob.
 VSM is fundamental to many information-retrieval operations,
from scoring documents on a query to document classification and
document clustering.
 It’s a mathematical model that represents text units as vectors.
 In the simplest form, these are vectors of identifiers, such as index
numbers in a corpus vocabulary.
 The most common way to calculate similarity between two text
blobs is using cosine similarity: the cosine of the angle between their
corresponding vectors.
 The cosine of 0° is 1 and the cosine of 180° is –1, with the cosine
monotonically decreasing from 0° to 180°.
Given two vectors, A and B, each with n components, the similarity
between them is computed as follows:

where Ai and Bi are the i th components of vectors A and B,

respectively. Sometimes, people also use Euclidean distance between
vectors to capture similarity
Basic Vectorization Approaches
Let’s start with a basic idea of text representation: map each word in
the vocabulary (V) of the text corpus to a unique ID (integer value),
then represent each sentence or document in the corpus as a V-
dimensional vector.
D1 – Dog bites man
D2 – Man bites dog
D3 – Dog eats meat
D4 – Man eats food
We first map each of the six words to unique IDs: dog = 1, bites = 2,
man = 3, meat = 4 , food = 5, eats = 6.ii Let’s consider the document
D1: “dog bites man”. As per the scheme, each word is a six-
dimensional vector. Dog is represented as [1 0 0 0 0 0], as the word
“dog” is mapped to ID 1. Bites is represented as [0 1 0 0 0 0], and so
on and so forth. Thus, D1 is represented as [ [1 0 0 0 0 0] [0 1 0 0 0 0]
[0 0 1 0 0 0]]. D4 is represented as [ [ 0 0 1 0 00] [0 0 0 0 0 1] [0 0 0 0
1 0]].
Shortcomings
1. Over fitting –The size of a one-hot vector is directly proportional
to size of the vocabulary, and most real-world corpora have large
vocabularies.
2. This representation does not give a fixed-length representation
for text - feature vectors to be of the same length.
3. It treats words as atomic units and has no notion of (dis)similarity
between words.
4. out of vocabulary (OOV) problem – man eats fruits, training data
doesn't include fruit.
Bag of Words
 Bag of words (BoW) is a classical text representation technique that
has been used commonly in NLP, especially in text classification
problems.
 The key idea behind it is as follows: represent the text under
consideration as a bag (collection) of words while ignoring the order
and context.
 The basic intuition behind it is that it assumes that the text
belonging to a given class in the dataset is characterized by a unique
set of words.
 If two text pieces have nearly the same words, then they belong to
the same bag (class).
 Thus, by analyzing the words present in a piece of text, one can
identify the class (bag) it belongs to.
D1 – Dog bites man
D2 – Man bites dog
D3 – Dog eats meat
D4 – Man eats food
where the word IDs are dog = 1, bites = 2, man = 3, meat = 4 , food =
5, eats = 6, D1 becomes [1 1 1 0 0 0]. This is because the first three
words in the vocabulary appeared exactly once in D1, and the last
three did not appear at all. D4 becomes [0 0 1 0 1 1].

Advantages
 BoW is fairly simple to understand and implement.
 The vector space resulting from the BoW scheme captures the
semantic similarity of documents.
 We have a fixed-length encoding for any sentence of arbitrary
length.
Disadvantages
 The size of the vector increases with the size of the vocabulary.
 It does not capture the similarity between different words that
mean the same thing.
This representation does not have any way to handle out of
vocabulary words (i.e., new words that were not seen in the corpus
that was used to build the vectorizer).
 As the name indicates, it is a “bag” of words—word order
information is lost in this representation.
 Term Frequency - Inverse Document Frequency (TF-IDF) is a widely
used statistical method in natural language processing and
information retrieval.
 It measures how important a term is within a document relative to a
collection of documents (i.e., relative to a corpus).
 Words within a text document are transformed into importance
numbers by a text vectorization process.
S1 – Dog bites man
S2 – Man bites dog
S3 – Dog eats meat
S4 – Man eats food
Different level of analysis required:
morphological analysis – The deep linguistic analysis process that
determines lexical and grammatical features of each token in
addition to the POS. The result of this analysis is a list of Universal
feature.

syntactic analysis – also called as syntax analysis or parsing, is the

process of analyzing NLP with the rules of a formal grammar.

semantic analysis – crucial part of NLP that concentrates on

understanding the meaning, interpretation and relationship between
words, phrases and sentences in a given context.

discourse analysis – extracting the meaning out of the corpus or text.

Process of performing text or language analysis which involves text
interpretation and knowing the social interactions. Very important in
NLP and helps train the NLP model better.
Bag of N - Grams

N-grams are continuous sequences of words or symbols or

tokens in a document and are defined as the neighboring
sequences of items in a document.
 They are used most importantly in tasks dealing with text data in
NLP (Natural Language Processing).
 Given a sequence of N-1 words, an N-gram model predicts the
most probable word that might follow this sequence.
 A model that simply relies on how often a word occurs without
looking at previous words is called unigram.
 If a model considers only the previous word to predict the
current word, then it's called bigram.
 If two previous words are considered, then it's a trigram model.
A statistical language model is the development of probabilistic
models to predict the probability of a sequence of words.
It is able to predict the next word in a sequence given a history
context represented by the preceding words.
The probability that we want to model can be factorized using the
chain rule as follows:

where w0 is a special token to denote the start of the sentence.

In practice, we usually use what is called N-Gram models that use
Markov process assumption to limit the history context. Examples of
N-Grams are:
Training Set:
The Arabian Knights
These are the fairy tales of the east
The stories of the Arabian knights are translated in
many language.
Bigram model:
P(the/<s>)=0.67= 2 /3 P(Arabian/the)=0.4 = 2 /5 P(Knights/Arabian)=1.0= 2 /2
P(are/these)=1.0 P(the/are)=0.5 P(fairy/the)=0.2
P(tales/fairy)=1.0 P(of/tales)=1.0 P(the/of)=1.0
P(east/the)=0.2 P(stories/the)=0.2 P(of/stories)=1.0 P(are/Knights)=1.0
P(translated/are)=0.5 P(in/translated)=1.0 P(many/in)=1.0
P(language/many)=1.0
Test sentence(S) : The Arabian knights are the fairy tales of the east
P(the/<s>)*P(Arabian/the)*P(Knights/Arabian)*P(are/these)*P(the/are)*P(fairy/the)*P(tale
s/fairy) *P(of/tales)*P(the/of)*P(east/the)*P(stories/the)*P(of/stories) *P(are/Knights)*
P(translated/are)* P(in/translated)*P(many/in)*P(language/many)
=0.67*0.4*1.0*1.0*0.5*0.2*1.0*1.0*1.0*0.2 =0.0067
A trigram model generates more natural sentences.
The main pros and cons of BoN:

• It captures some context and word-order information in the form of

n-grams.

• Thus, resulting vector space is able to capture some semantic

similarity.
Documents having the same n-grams will have their vectors closer to
each other in
Euclidean space as compared to documents with completely different
n-grams.

• As n increases, dimensionality (and therefore sparsity) only increases

rapidly.

• It still provides no way to address the OOV problem.

The n-gram model suffers from data sparseness problem

 An n-gram that does not occur in the training data is assigned

zero probability, so that even a large corpus has several zero
entries in the bigram matrix.
 A number of smoothing techniques have been developed to
handle the data sparseness problem.
 The word smoothing is used to denote these techniques.
 Because they tend to make distributions more uniform by moving
the extreme probabilities towards the average
Distributed Representations
 We saw some key drawbacks that are common to all basic
vectorization approaches.
 To overcome these limitations, methods to learn low dimensional
representations were devised.
 They use neural network architectures to create dense, low-
dimensional representations of words and texts.

We need to understand some key terms:

 Distributional similarity
This is the idea that the meaning of a word can be understood from
the context in which the word appears. This is also known as
connotation: meaning is defined by context. This is opposed to
denotation: the literal meaning of any word.
 Distributional hypothesis
In linguistics, this hypothesizes that words that occur in similar
contexts have similar meanings.
According to the distributional hypothesis, there must be a strong
similarity between the meanings of these two words.
If two words often occur in similar context, then their corresponding
representation vectors must also be close to each other.

 Distributional representation
This refers to representation schemes that are obtained based on
distribution of words from the context in which the words appear.
These schemes are based on distributional hypotheses.

 Distributed representation
It is based on the distributional hypothesis.
Distributed representation schemes significantly compress the
dimensionality. This results in vectors that are compact (i.e., low
dimensional) and dense (i.e., hardly any zeros).
 Embedding
For the set of words in a corpus, embedding is a mapping between
vector space coming from distributional representation to vector space
coming from distributed representation.

 Vector semantics
This refers to the set of NLP methods that aim to learn the word
representations based on distributional properties of words in a large
corpus.
Word Embeddings

In 2013, a seminal work by Mikolov et al. showed that their neural net‐
work–based word representation model known as “Word2vec,” based
on “distributional similarity,” can capture word analogy relationships
such as:

King – Man + Woman ≈ Queen

Training our own embeddings

We’ll look at two architectural variants that were proposed in the

original Word2vec approach.

 Continuous bag of words (CBOW)

 Skip Gram

Both of these have a lot of similarities in many respects. We’ll use the
sentence “The quick brown fox jumps over the lazy dog” as our toy
corpus.
CBOW
• The primary task is to build a language model that correctly predicts
the center word given the context words in which the center word
appears.
• The objective of a language model is to assign probabilities in such a
way that it gives high probability to “good” sentences and low
probabilities to “bad” sentences.
• By good, we mean sentences that are semantically and syntactically
correct.
• By bad, we mean sentences that are incorrect—semantically or
syntactically or both.
•For a sentence like “The cat jumped over the dog,” it will try to assign
a probability close to 1.0, whereas for a sentence like “jumped over
the the cat dog,” it tries to assign a probability close to 0.0.
• CBOW tries to learn a language model that tries to predict the
“center” word from the words in its context.
Skip Gram
 In Skip‐ Gram, the task is to predict the context words from the
center word.
 For our toy corpus with context size 2, using the center word
“jumps,” we try to predict every word in context—“brown,” “fox,”
“over,” “the”
 This constitutes one step. Skip Gram repeats this one step for every
word in the corpus as the center word.
 To use both the CBOW and Skip Gram algorithms in practice, there
are several avail‐ able implementations that abstract the
mathematical details for us. One of the most commonly used
implementations is gensim

A Taxonomy of Retrieval Augmented Generation
100% (2)
A Taxonomy of Retrieval Augmented Generation
56 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Lect 04
No ratings yet
Lect 04
44 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Unit IV
No ratings yet
Unit IV
57 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
04 - Text Representation
No ratings yet
04 - Text Representation
131 pages
Unit IV
No ratings yet
Unit IV
58 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
Embeddings
No ratings yet
Embeddings
3 pages
Lab 5
No ratings yet
Lab 5
27 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
Text Mining
No ratings yet
Text Mining
34 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Text Mining
No ratings yet
Text Mining
62 pages
Unit 2 TB
No ratings yet
Unit 2 TB
20 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
NLP Notes
No ratings yet
NLP Notes
12 pages
DM Chapter 9 - Word Embedding
No ratings yet
DM Chapter 9 - Word Embedding
7 pages
Project Final Presentation
No ratings yet
Project Final Presentation
30 pages
Pipeline
No ratings yet
Pipeline
9 pages
Text Representation: Lecture # 6
No ratings yet
Text Representation: Lecture # 6
21 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
NLP Q2 21SAL54 Scheme
No ratings yet
NLP Q2 21SAL54 Scheme
6 pages
NLP 2
No ratings yet
NLP 2
8 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
Unit 2
No ratings yet
Unit 2
21 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
103 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
Module 1.2
No ratings yet
Module 1.2
28 pages
Unit 2 Updated New
No ratings yet
Unit 2 Updated New
77 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Ass7 Write Up .Final
No ratings yet
Ass7 Write Up .Final
11 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
Sheet 3
No ratings yet
Sheet 3
5 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
cs224n 2025 Lecture02 Wordvecs2
No ratings yet
cs224n 2025 Lecture02 Wordvecs2
46 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
NLP Key Points
No ratings yet
NLP Key Points
3 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
NLP Notes
No ratings yet
NLP Notes
11 pages
Part 3
No ratings yet
Part 3
5 pages
NLP Basic - YL
No ratings yet
NLP Basic - YL
16 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Unit 6 Endsem PYQs
No ratings yet
Unit 6 Endsem PYQs
15 pages
Master Thesis
No ratings yet
Master Thesis
74 pages
Basics of Bag of Words Model
No ratings yet
Basics of Bag of Words Model
32 pages
Valmiki Uday Kiran - WIE - CODE2k24 - Participation - Certificate
No ratings yet
Valmiki Uday Kiran - WIE - CODE2k24 - Participation - Certificate
1 page
Nvidia - Company - Short - Listed - Students
No ratings yet
Nvidia - Company - Short - Listed - Students
2 pages
AI and Animated Character Design: Efficiency, Creativity, Interactivity
No ratings yet
AI and Animated Character Design: Efficiency, Creativity, Interactivity
7 pages
Ideathon
No ratings yet
Ideathon
5 pages
Career Quest Event: Nvidia
No ratings yet
Career Quest Event: Nvidia
2 pages
SMT 1
No ratings yet
SMT 1
1 page
VimanaKriti PDF
No ratings yet
VimanaKriti PDF
6 pages
Eisenstein
No ratings yet
Eisenstein
305 pages
A Survey On Data Collection For Machine Learning
No ratings yet
A Survey On Data Collection For Machine Learning
49 pages
Data Miningof Public Opinion An Overview
No ratings yet
Data Miningof Public Opinion An Overview
12 pages
DIT865 2018 Mar Solution
No ratings yet
DIT865 2018 Mar Solution
9 pages
Sentiment Analysis Using DL
No ratings yet
Sentiment Analysis Using DL
20 pages
Effect of Word Sense Disambiguation On Neural Machine Translation A Case Study in Korean
No ratings yet
Effect of Word Sense Disambiguation On Neural Machine Translation A Case Study in Korean
12 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
Error Logs Simplified
No ratings yet
Error Logs Simplified
10 pages
315329-Natural Language Processing
No ratings yet
315329-Natural Language Processing
7 pages
INLP Assignment 3
No ratings yet
INLP Assignment 3
5 pages
Topic Modeling in Embedding Spaces
No ratings yet
Topic Modeling in Embedding Spaces
12 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
RoBERTa-GCN A Novel Approach For Combating Fake News in Bangla Using Advanced Language Processing and Graph Convolutional Networks
No ratings yet
RoBERTa-GCN A Novel Approach For Combating Fake News in Bangla Using Advanced Language Processing and Graph Convolutional Networks
20 pages
Full Chapter Programming Large Language Models With Azure Open Ai Conversational Programming and Prompt Engineering With Llms Developer Reference 1St Edition Esposito PDF
100% (25)
Full Chapter Programming Large Language Models With Azure Open Ai Conversational Programming and Prompt Engineering With Llms Developer Reference 1St Edition Esposito PDF
54 pages
Assignment 1 NLP
No ratings yet
Assignment 1 NLP
3 pages
Haddow 等 - 2022 - Survey of Low-Resource Machine Translation
No ratings yet
Haddow 等 - 2022 - Survey of Low-Resource Machine Translation
60 pages
Large Language Models A Comprehensive Survey of It
No ratings yet
Large Language Models A Comprehensive Survey of It
30 pages
Deep Learning Based Fusion Approach For Hate Speech Detection
No ratings yet
Deep Learning Based Fusion Approach For Hate Speech Detection
7 pages
Natural Language Processing - Revolutionizing Human-Computer Interaction
No ratings yet
Natural Language Processing - Revolutionizing Human-Computer Interaction
5 pages
Speech and Language Processing An Introduction To Natural Language Processing Computational Linguistics and Speech Recognition 3rd Edition Daniel Jurafsky Download
No ratings yet
Speech and Language Processing An Introduction To Natural Language Processing Computational Linguistics and Speech Recognition 3rd Edition Daniel Jurafsky Download
57 pages
A Text Classification Model Based On GCN and BiGRU Fusion
No ratings yet
A Text Classification Model Based On GCN and BiGRU Fusion
5 pages
Kushwaha-Kar2021 Article MarkBotALanguageModel-DrivenCh
No ratings yet
Kushwaha-Kar2021 Article MarkBotALanguageModel-DrivenCh
18 pages
Leveraging LLM: Implementing An Advanced AI Chatbot For Healthcare
No ratings yet
Leveraging LLM: Implementing An Advanced AI Chatbot For Healthcare
8 pages
Efficient Lipophilicity Prediction of Molecules Employing Deep-Learning Models
No ratings yet
Efficient Lipophilicity Prediction of Molecules Employing Deep-Learning Models
13 pages
Lec12 Self Supervised Learning
No ratings yet
Lec12 Self Supervised Learning
91 pages
Unit 2
No ratings yet
Unit 2
26 pages
cs224n 2023 Lecture9 Pretraining
No ratings yet
cs224n 2023 Lecture9 Pretraining
54 pages
Intent Recognition System Research Plan
No ratings yet
Intent Recognition System Research Plan
31 pages
Requirements Similarity and Retrieval: July 2024
No ratings yet
Requirements Similarity and Retrieval: July 2024
28 pages

NLP - Module 2

Uploaded by

NLP - Module 2

Uploaded by

SUBJECT CODE BY

Step 1: Lowercase / Uppercase

Step 3: HTML Code and URL Links

Step 4: Spell Checks

 One approach is to run the text through a spell checker such as

Step 6 : Removing Stop Words

 Given a labeled text corpus and to build a sentiment analysis model.

 The semantics (meaning) of the sentence arises from the

where Ai and Bi are the i th components of vectors A and B,

syntactic analysis – also called as syntax analysis or parsing, is the

semantic analysis – crucial part of NLP that concentrates on

discourse analysis – extracting the meaning out of the corpus or text.

N-grams are continuous sequences of words or symbols or

where w0 is a special token to denote the start of the sentence.

• It captures some context and word-order information in the form of

• Thus, resulting vector space is able to capture some semantic

• As n increases, dimensionality (and therefore sparsity) only increases

• It still provides no way to address the OOV problem.

 An n-gram that does not occur in the training data is assigned

We need to understand some key terms:

King – Man + Woman ≈ Queen

We’ll look at two architectural variants that were proposed in the

 Continuous bag of words (CBOW)

You might also like