NLP Assignment2

Uploaded by

laiba Abdullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views7 pages

NLP Assignment2

Uploaded by

laiba Abdullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Natural Language Processing

A company has a dataset containing raw customer reviews. Design an NLP

pipeline to preprocess this data for sentiment analysis. Discuss each step-in
detail.
Example Data

Example Data
"The product was fantastic! But delivery was delayed."
"Horrible customer service. Would not recommend!"
"Great quality for the price."

NLP Pipeline Design for Sentiment Analysis:

The preprocessing of raw customer reviews for sentiment analysis involves a series of steps in a
natural language processing pipeline for converting unstructured text into structured data suitable
for model training. Each step in this pipeline ensures the cleaning, standardization, and then
transformation of the data into features that capture the sentiment-relevant aspects of the text. A
detailed explanation of the pipeline follows.

1. Text Cleaning: Remove irrelevant characters such as special symbols (!@#), numbers,
and extra spaces to ensure uniformity. Convert text to lowercase to standardize
comparisons (e.g., “Fantastic” becomes “fantastic”). For example, "The product was fantastic!
But delivery was delayed." becomes "the product was fantastic but delivery was delayed".
2. Tokenization: Split the text into individual words or tokens using libraries like NLTK or
spaCy. This breaks sentences into analyzable units. The cleaned review splits into tokens:
["the", "product", "was", "fantastic", "but", "delivery", "was", "delayed"].

3. Stopword Removal: Remove common words (e.g., "the," "was," "but") that don’t carry
sentiment information. This helps reduce noise, leaving tokens like ["product", "fantastic",
"delivery", "delayed"].
4. Stemming/Lemmatization: Convert words to their root forms to ensure consistency. For
instance, "delayed" reduces to "delay" (stemming) or remains "delayed" with
grammatical meaning intact (lemmatization). This step is useful for aligning similar
words across reviews.
5. Part-of-Speech (POS) Tagging: Identify word types (e.g., adjectives, nouns) to focus on
sentiment-bearing words such as "fantastic" or "horrible."
6. Sentiment Lexicon Mapping : Annotate words using a pre-built sentiment lexicon (e.g.,
VADER, SentiWordNet) to attach polarity scores. This helps assign preliminary
sentiment values to individual words in reviews.
7. Text Vectorization: Convert text into numerical features using methods like Bag of
Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), or Word
Embeddings (e.g., Word2Vec, GloVe, BERT). For example, BoW represents "great quality
for the price" as a sparse vector indicating word occurrences, while Word2Vec captures
word relationships in dense vectors.
8. Handling Negations: Incorporate rules for negation handling (e.g., “not recommend”
changes the polarity of “recommend”). Techniques such as adding a “not” prefix to
subsequent words help retain context.
9. Feature Scaling : Normalize vectorized features to improve machine learning model
performance. For instance, scale TF-IDF scores or word embeddings to a uniform range.
10. Label Encoding : Assign numeric sentiment labels to reviews based on polarity (e.g.,
Positive = 1, Negative = 0). The example "Horrible customer service. Would not recommend!" is
labeled as Negative (0), while "Great quality for the price." is Positive (1).
Write a Python function to tokenize and remove stop words from the “The
quick brown fox jumps over the lazy dog.”. Explain how this step affects the
quality of an NLP model.

Preprocessing a sentence by stop word removal and tokenization enhances the quality of an NLP
model by making it focus on meaningful words while reducing unnecessary noise. The presence
of stop words, such as "the" and "is," does not add value toward understanding the context or
intent of the text. Without these, the model will only process relevant data to reduce
dimensionality and computation efficiency. This also increases the signal-to-noise ratio, allowing
the model to learn patterns more effectively. For instance, in the sentence "The quick brown fox
jumps over the lazy dog," the core context is retained in the words "quick," "brown," and
"jumps" which are important to understand. These steps will enhance performance, for example,
in activities involving sentiment analysis, text classification, and topic modeling; these require
clear and concise input.
Compare and contrast traditional feature extraction techniques like TF-IDF
with modern embeddings like Word2Vec and BERT. Discuss the impact of
these advancements.

Comparison of Traditional and Modern Feature Extraction Techniques

Word2Vec / BERT
Aspect TF-IDF (Traditional)
(Modern)
Sparse matrix based on word Dense vector embeddings
Representation
frequency. capturing word semantics.
Word2Vec: Considers local
context.
No understanding of word
Context Awareness BERT: Considers full
context.
sentence context
(bidirectional).
High-dimensional (one Low-dimensional (fixed-size
Dimensionality dimension per word in vectors, e.g., 300 for
vocabulary). Word2Vec).
Treats synonyms as separate Captures semantic similarity,
Handling Synonyms
entities. treating synonyms as similar.
Embeddings like BERT
Fails to capture word order or
Sentence Context understand sentence meaning
sentence meaning.
and structure.
Requires manual feature Pre-trained on massive
Pre-training engineering for every new corpora, requiring fine-tuning
dataset. for specific tasks.
Excels in complex tasks
Suitable for small datasets
Performance requiring deeper
and simpler tasks.
understanding of language.

Impact of Advancements

Modern embeddings, such as Word2Vec and BERT, substantially increase the bar for NLP tasks
by providing far richer and contextually aware text representations. It thus can grasp more
refined shades of meaning such as polysemy or relations depending on context and be useful for
sentiment analysis, machine translation, or text classification. A very important advantage of the
pre-trained models is the time and computational resources that one can save; instead of being
trained from scratch, a model is fine-tuned for the task at hand.

You are tasked with classifying emails as spam or non-spam. Justify the choice
of feature extraction technique you would use and why.

Which feature extraction technique to select depends on the complexity of the task at hand and
the nature of the dataset. The TF-IDF is one of the strongest candidates in a standard spam
detection task, since it would strongly point out key distinguishing terms such as "free," "click
here," or "win now," which characterize spam emails. This in turn makes this computationally
efficient and simple, the perfect choice for rather small, pretty datasets; it’s naïve implementation
requires no pretraining, hence it's easy and interpretable.

But in contexts where an email dataset deals with nuances or context, or probably subtle phishing emails,
BERT could be a very good choice. BERT brings understanding of the intent and semantic meaning of the
text, which can go much beyond mere frequencies of words to deep intent in emails. A person can obtain
tremendous gains from this when fine-tuning pre-trained BERT on email data, especially in complex
situations where indications toward spamming depend upon tones or structures.

Therefore, while TF-IDF may work well enough for simple scenarios where there are obvious spam
keywords, BERT will work much better for more complex, subtle spam classification issues.

Write a Python script to train a Recurrent Neural Network (RNN) on the Shakespeare text dataset
available at this link. Follow the NLP pipeline to train the model and generate Shakespearean-style
text.

Recurrent Neural Networks (RNNs) are intended to handle sequential input, making them excellent for
text generation jobs. When applied to the Shakespeare text dataset, an RNN learns patterns and structures
by processing the text sequentially and predicting the next character or word. Following the NLP pipeline,
the dataset is preprocessed, tokenized, and then utilized to train the model. Once trained, the RNN can
construct Shakespearean text by predicting and appending tokens sequentially, capturing the original
works' distinct style and flow. The URL to the model is provided below:

RNN MODEL LINK

Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
NLP Preprocessing Steps
No ratings yet
NLP Preprocessing Steps
20 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
21 pages
British Airways Forage Report
No ratings yet
British Airways Forage Report
12 pages
The+Routledge+Student+Guide+to+English+Usage+ +a+Guide+to+Acade
No ratings yet
The+Routledge+Student+Guide+to+English+Usage+ +a+Guide+to+Acade
403 pages
NLP 1 Week Tutorial NLTK
No ratings yet
NLP 1 Week Tutorial NLTK
15 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
What Is NLP?
No ratings yet
What Is NLP?
74 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
CS-875-Lecture 4
No ratings yet
CS-875-Lecture 4
47 pages
Government of Karnataka O/o Commissioner For Public Instruction, Nrupatunga Road, Bangalore - 560001
No ratings yet
Government of Karnataka O/o Commissioner For Public Instruction, Nrupatunga Road, Bangalore - 560001
43 pages
The Lost Treasure (Play Script) Reading
No ratings yet
The Lost Treasure (Play Script) Reading
4 pages
Reported Speech
0% (1)
Reported Speech
5 pages
20 Rules of Subject Verb Agreement
No ratings yet
20 Rules of Subject Verb Agreement
6 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
Future Continuous Tense in Urdu by EA Spoken English With Emran Ali Rai On YouTube
No ratings yet
Future Continuous Tense in Urdu by EA Spoken English With Emran Ali Rai On YouTube
11 pages
Sentiment Analysis Using NLP
No ratings yet
Sentiment Analysis Using NLP
42 pages
Instant Download The Stoics On Lekta: All There Is To Say Ada Bronowski PDF All Chapter
100% (3)
Instant Download The Stoics On Lekta: All There Is To Say Ada Bronowski PDF All Chapter
64 pages
R Controlled Vowels
No ratings yet
R Controlled Vowels
9 pages
Restaurant Review Production Analysis Using Python
No ratings yet
Restaurant Review Production Analysis Using Python
33 pages
Sentimental Analysis
No ratings yet
Sentimental Analysis
37 pages
Key Stage 3 English
100% (1)
Key Stage 3 English
2 pages
Unit 2
No ratings yet
Unit 2
34 pages
Savage Music Technology 2012
100% (1)
Savage Music Technology 2012
21 pages
WDM - Week - I
No ratings yet
WDM - Week - I
24 pages
NLP Assignment (917722H031)
No ratings yet
NLP Assignment (917722H031)
18 pages
Unit2 Full
No ratings yet
Unit2 Full
28 pages
Wa0002
No ratings yet
Wa0002
21 pages
Unit 5b - Natural Language Processing
No ratings yet
Unit 5b - Natural Language Processing
41 pages
Sentiment Analysis Using Machine Learning Classifiers
No ratings yet
Sentiment Analysis Using Machine Learning Classifiers
41 pages
NLP Preprocessing Steps 1740444240
No ratings yet
NLP Preprocessing Steps 1740444240
20 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Analysis of The Evolution of Advanced Transformer-Based Language Models: Experiments On Opinion Mining
No ratings yet
Analysis of The Evolution of Advanced Transformer-Based Language Models: Experiments On Opinion Mining
16 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
Module2.4 Text Processing
No ratings yet
Module2.4 Text Processing
17 pages
Assignment 2 Semantic Gradient
100% (1)
Assignment 2 Semantic Gradient
8 pages
RES Presentation
No ratings yet
RES Presentation
21 pages
Content - Giang
No ratings yet
Content - Giang
71 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Mini Project
No ratings yet
Mini Project
16 pages
NLP Manual
No ratings yet
NLP Manual
21 pages
Declarative Sentences PDF
No ratings yet
Declarative Sentences PDF
4 pages
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
No ratings yet
A E A T - B L M: E O M: Nalysis of The Volution of Dvanced Ransformer Ased Anguage Odels Xperiments On Pinion Ining
16 pages
NLP Algorithms and Pipeline
No ratings yet
NLP Algorithms and Pipeline
6 pages
Blue Doodle Project Presentation
No ratings yet
Blue Doodle Project Presentation
15 pages
Yi 1999 Is Mereology Ontologically Innocent
No ratings yet
Yi 1999 Is Mereology Ontologically Innocent
21 pages
Experiential Learning
No ratings yet
Experiential Learning
8 pages
Gen Ai 6,7
No ratings yet
Gen Ai 6,7
6 pages
Literature Review
No ratings yet
Literature Review
25 pages
Text Preprocessing Stages
No ratings yet
Text Preprocessing Stages
8 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
Reading Practice 1
No ratings yet
Reading Practice 1
23 pages
The Effectiveness of Teaching Vocabulary by Using Cartoon Film Toward Vocabulary Mastery of EFL Students
No ratings yet
The Effectiveness of Teaching Vocabulary by Using Cartoon Film Toward Vocabulary Mastery of EFL Students
32 pages
DeekshikaJadyada AP24LDS11
No ratings yet
DeekshikaJadyada AP24LDS11
6 pages
NLP Project (Documentation)
No ratings yet
NLP Project (Documentation)
8 pages
ISSS609 Project Proposal Group 7
No ratings yet
ISSS609 Project Proposal Group 7
8 pages
2022 2023学年上学期高一英语期中考试 2022 2023学年高一英语必修第一册单元重难点易错题精练（外研版2019）
No ratings yet
2022 2023学年上学期高一英语期中考试 2022 2023学年高一英语必修第一册单元重难点易错题精练（外研版2019）
18 pages
ASTW RA03 PracticalManual
No ratings yet
ASTW RA03 PracticalManual
18 pages
Paired Conjunctions
No ratings yet
Paired Conjunctions
18 pages
Sentiment Analysis Behind Text With Different Length and Formality
No ratings yet
Sentiment Analysis Behind Text With Different Length and Formality
6 pages
Basenlp
No ratings yet
Basenlp
5 pages
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
No ratings yet
Sentiment Analysis On Amazon Fine Food Reviews by Using Linear Machine Learning Models
6 pages
Lab Manual
No ratings yet
Lab Manual
10 pages
NLPNEW
No ratings yet
NLPNEW
3 pages
AI Phash3
No ratings yet
AI Phash3
11 pages
Dav Exp7 56
No ratings yet
Dav Exp7 56
8 pages
Reading Skills
No ratings yet
Reading Skills
2 pages
Building A Simple Chatbot From Scratch in Python1
No ratings yet
Building A Simple Chatbot From Scratch in Python1
8 pages
Maneesha Nidigonda Verzeo Major Project
No ratings yet
Maneesha Nidigonda Verzeo Major Project
11 pages
Panchbhai 2021
No ratings yet
Panchbhai 2021
6 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
M4 10apr 17apr 6A
No ratings yet
M4 10apr 17apr 6A
7 pages
Assignment-10 (NLP-part-2)
No ratings yet
Assignment-10 (NLP-part-2)
2 pages
Natural Language Processing - NOTES
No ratings yet
Natural Language Processing - NOTES
4 pages
Maneesha Nidigonda Major Project
No ratings yet
Maneesha Nidigonda Major Project
11 pages
Sentiment Analysis From H El Reviews: Data Mining For Business Intelligence
No ratings yet
Sentiment Analysis From H El Reviews: Data Mining For Business Intelligence
13 pages
NLP - Assignment2 Proper RNN Working
No ratings yet
NLP - Assignment2 Proper RNN Working
3 pages
This Next Video Is On Inferring. I Video 5
No ratings yet
This Next Video Is On Inferring. I Video 5
2 pages
LP For Organization, Language Use and Mechanics
No ratings yet
LP For Organization, Language Use and Mechanics
5 pages
Final Speaking Practice Grade 4
No ratings yet
Final Speaking Practice Grade 4
2 pages
Allophones and Phonemes
No ratings yet
Allophones and Phonemes
3 pages
POINTERS
No ratings yet
POINTERS
4 pages
Francis Buchanan: A Comparative Vocabulary of Some of The Languages Spoken in The Burma Empire (1799)
100% (1)
Francis Buchanan: A Comparative Vocabulary of Some of The Languages Spoken in The Burma Empire (1799)
18 pages
New Inside Out Intermediate Unit 6 Test: Part A
No ratings yet
New Inside Out Intermediate Unit 6 Test: Part A
5 pages
First Language Acquisition
100% (1)
First Language Acquisition
4 pages
Predicting The Reviews of The Restaurant Using Natural Language Processing Technique
No ratings yet
Predicting The Reviews of The Restaurant Using Natural Language Processing Technique
4 pages
Comparation Adjectives
No ratings yet
Comparation Adjectives
1 page
Venetic in The Northeast and Messapian in The Extreme
No ratings yet
Venetic in The Northeast and Messapian in The Extreme
1 page
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet

NLP Assignment2

Uploaded by

NLP Assignment2

Uploaded by

Natural Language Processing

A company has a dataset containing raw customer reviews. Design an NLP

NLP Pipeline Design for Sentiment Analysis:

Comparison of Traditional and Modern Feature Extraction Techniques

RNN MODEL LINK

You might also like