0% found this document useful (0 votes)
6 views

NLP_Assignment2

Uploaded by

laiba Abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

NLP_Assignment2

Uploaded by

laiba Abdullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Natural Language Processing

A company has a dataset containing raw customer reviews. Design an NLP


pipeline to preprocess this data for sentiment analysis. Discuss each step-in
detail.
Example Data

Example Data
"The product was fantastic! But delivery was delayed."
"Horrible customer service. Would not recommend!"
"Great quality for the price."

NLP Pipeline Design for Sentiment Analysis:

The preprocessing of raw customer reviews for sentiment analysis involves a series of steps in a
natural language processing pipeline for converting unstructured text into structured data suitable
for model training. Each step in this pipeline ensures the cleaning, standardization, and then
transformation of the data into features that capture the sentiment-relevant aspects of the text. A
detailed explanation of the pipeline follows.

1. Text Cleaning: Remove irrelevant characters such as special symbols (!@#), numbers,
and extra spaces to ensure uniformity. Convert text to lowercase to standardize
comparisons (e.g., “Fantastic” becomes “fantastic”). For example, "The product was fantastic!
But delivery was delayed." becomes "the product was fantastic but delivery was delayed".
2. Tokenization: Split the text into individual words or tokens using libraries like NLTK or
spaCy. This breaks sentences into analyzable units. The cleaned review splits into tokens:
["the", "product", "was", "fantastic", "but", "delivery", "was", "delayed"].

3. Stopword Removal: Remove common words (e.g., "the," "was," "but") that don’t carry
sentiment information. This helps reduce noise, leaving tokens like ["product", "fantastic",
"delivery", "delayed"].
4. Stemming/Lemmatization: Convert words to their root forms to ensure consistency. For
instance, "delayed" reduces to "delay" (stemming) or remains "delayed" with
grammatical meaning intact (lemmatization). This step is useful for aligning similar
words across reviews.
5. Part-of-Speech (POS) Tagging: Identify word types (e.g., adjectives, nouns) to focus on
sentiment-bearing words such as "fantastic" or "horrible."
6. Sentiment Lexicon Mapping : Annotate words using a pre-built sentiment lexicon (e.g.,
VADER, SentiWordNet) to attach polarity scores. This helps assign preliminary
sentiment values to individual words in reviews.
7. Text Vectorization: Convert text into numerical features using methods like Bag of
Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), or Word
Embeddings (e.g., Word2Vec, GloVe, BERT). For example, BoW represents "great quality
for the price" as a sparse vector indicating word occurrences, while Word2Vec captures
word relationships in dense vectors.
8. Handling Negations: Incorporate rules for negation handling (e.g., “not recommend”
changes the polarity of “recommend”). Techniques such as adding a “not” prefix to
subsequent words help retain context.
9. Feature Scaling : Normalize vectorized features to improve machine learning model
performance. For instance, scale TF-IDF scores or word embeddings to a uniform range.
10. Label Encoding : Assign numeric sentiment labels to reviews based on polarity (e.g.,
Positive = 1, Negative = 0). The example "Horrible customer service. Would not recommend!" is
labeled as Negative (0), while "Great quality for the price." is Positive (1).
Write a Python function to tokenize and remove stop words from the “The
quick brown fox jumps over the lazy dog.”. Explain how this step affects the
quality of an NLP model.

Preprocessing a sentence by stop word removal and tokenization enhances the quality of an NLP
model by making it focus on meaningful words while reducing unnecessary noise. The presence
of stop words, such as "the" and "is," does not add value toward understanding the context or
intent of the text. Without these, the model will only process relevant data to reduce
dimensionality and computation efficiency. This also increases the signal-to-noise ratio, allowing
the model to learn patterns more effectively. For instance, in the sentence "The quick brown fox
jumps over the lazy dog," the core context is retained in the words "quick," "brown," and
"jumps" which are important to understand. These steps will enhance performance, for example,
in activities involving sentiment analysis, text classification, and topic modeling; these require
clear and concise input.
Compare and contrast traditional feature extraction techniques like TF-IDF
with modern embeddings like Word2Vec and BERT. Discuss the impact of
these advancements.

Comparison of Traditional and Modern Feature Extraction Techniques


Word2Vec / BERT
Aspect TF-IDF (Traditional)
(Modern)
Sparse matrix based on word Dense vector embeddings
Representation
frequency. capturing word semantics.
Word2Vec: Considers local
context.
No understanding of word
Context Awareness BERT: Considers full
context.
sentence context
(bidirectional).
High-dimensional (one Low-dimensional (fixed-size
Dimensionality dimension per word in vectors, e.g., 300 for
vocabulary). Word2Vec).
Treats synonyms as separate Captures semantic similarity,
Handling Synonyms
entities. treating synonyms as similar.
Embeddings like BERT
Fails to capture word order or
Sentence Context understand sentence meaning
sentence meaning.
and structure.
Requires manual feature Pre-trained on massive
Pre-training engineering for every new corpora, requiring fine-tuning
dataset. for specific tasks.
Excels in complex tasks
Suitable for small datasets
Performance requiring deeper
and simpler tasks.
understanding of language.

Impact of Advancements

Modern embeddings, such as Word2Vec and BERT, substantially increase the bar for NLP tasks
by providing far richer and contextually aware text representations. It thus can grasp more
refined shades of meaning such as polysemy or relations depending on context and be useful for
sentiment analysis, machine translation, or text classification. A very important advantage of the
pre-trained models is the time and computational resources that one can save; instead of being
trained from scratch, a model is fine-tuned for the task at hand.

You are tasked with classifying emails as spam or non-spam. Justify the choice
of feature extraction technique you would use and why.

Which feature extraction technique to select depends on the complexity of the task at hand and
the nature of the dataset. The TF-IDF is one of the strongest candidates in a standard spam
detection task, since it would strongly point out key distinguishing terms such as "free," "click
here," or "win now," which characterize spam emails. This in turn makes this computationally
efficient and simple, the perfect choice for rather small, pretty datasets; it’s naïve implementation
requires no pretraining, hence it's easy and interpretable.

But in contexts where an email dataset deals with nuances or context, or probably subtle phishing emails,
BERT could be a very good choice. BERT brings understanding of the intent and semantic meaning of the
text, which can go much beyond mere frequencies of words to deep intent in emails. A person can obtain
tremendous gains from this when fine-tuning pre-trained BERT on email data, especially in complex
situations where indications toward spamming depend upon tones or structures.

Therefore, while TF-IDF may work well enough for simple scenarios where there are obvious spam
keywords, BERT will work much better for more complex, subtle spam classification issues.

Write a Python script to train a Recurrent Neural Network (RNN) on the Shakespeare text dataset
available at this link. Follow the NLP pipeline to train the model and generate Shakespearean-style
text.

Recurrent Neural Networks (RNNs) are intended to handle sequential input, making them excellent for
text generation jobs. When applied to the Shakespeare text dataset, an RNN learns patterns and structures
by processing the text sequentially and predicting the next character or word. Following the NLP pipeline,
the dataset is preprocessed, tokenized, and then utilized to train the model. Once trained, the RNN can
construct Shakespearean text by predicting and appending tokens sequentially, capturing the original
works' distinct style and flow. The URL to the model is provided below:

RNN MODEL LINK

You might also like