0% found this document useful (0 votes)

118 views32 pages

Python Text Processing and NLP Basics

Gffvbgfs

Uploaded by

surajj0624

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

118 views32 pages

Python Text Processing and NLP Basics

Gffvbgfs

Uploaded by

surajj0624

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Unit 1: Python Text and NLP Basics

1.1 Introduction to Python Text Basics

Python offers numerous tools for handling and manipulating text. The most basic of these are
string operations, but for more advanced tasks, we use libraries like re for regular expressions
and Spacy for NLP-specific functions.

Basic String Operations

Operation Example Code Output

Lowercasing text = "HELLO"; text.lower() 'hello'

Splitting text = "Hello, World!"; ['Hello', '

text.split(',') World!']

Concatenation a = "Hello"; b = "World"; c = a + " " 'Hello World'

+ b

Replacing text = "I am happy"; 'I am sad'

text.replace('happy', 'sad')

String operations in Python are efficient for simple text processing tasks, such as breaking a
sentence into words, converting text to lowercase, or replacing substrings.

File Handling in Python

Operation Description Example Code Output

Reading a Reads the entire content of file.read() Contents of

File the file. file

Reading Reads the file line by line and file.readlines() List of lines in
Lines returns a list. file

Writing to a Writes data to the file file.write("Hello -

File (overwrites existing data). World")

Example:

with open('example.txt', 'r', encoding='utf-8') as file:

content = file.read()
print(content) # Output: contents of 'example.txt'

1.2 Working with PDFs

PDF Text Extraction

Python libraries such as PyPDF2 and pdfminer.six are commonly used to extract text from
PDF documents.

Library Description Example Code Output

Example

PyPDF2 Simple library text = Text from

to extract PDF pdf_reader.getPage(0).extractText( page 1 of
text. ) PDF

pdfminer.six More powerful extract_text('file.pdf') Complete

for text text from
extraction. PDF

Example:

import PyPDF2

with open('sample.pdf', 'rb') as pdf_file:

reader = PyPDF2.PdfFileReader(pdf_file)
text = reader.getPage(0).extractText()
print(text) # Outputs text from the first page

1.3 Introduction to Regular Expressions (Regex)

Regular expressions allow us to define complex patterns to search, match, or manipulate text.
Python’s re module provides a variety of functions to work with regex.
Regex Description Example Code Output
Operation

Finding Find all occurrences of a re.findall(r'\d+', ['123']

Patterns pattern in text. 'User123 data')

Substituting Replace parts of the text re.sub(r'\d+', 'ID', 'UserID'

Patterns that match a pattern. 'User123')

Shorthand Predefined classes for re.findall(r'\w+', ['Text',

Classes matching. 'Text 123') '123']

Character Matches a range of re.findall(r'[A-Z]', ['H', 'W']

Ranges characters. 'Hello World')

Common Regex Patterns

Pattern Meaning Example Matches

\d Any digit (0-9) \d+ "123" from "User123"

\w Any word character (a-z, A-Z, 0-9) \w+ "User123"

\s Any whitespace character \s+ " " (space)

[a-z] Any lowercase letter [a-z]+ "text" from "text"

Example: Removing Special Characters

python

import re
text = "Hello! Welcome to NLP 101."
clean_text = re.sub(r'[^A-Za-z\s]', '', text) # Removes anything that
is not a letter or space
print(clean_text) # Output: "Hello Welcome to NLP"

1.4 Preprocessing using Regex

Preprocessing is the crucial first step in any NLP pipeline, ensuring that the data is cleaned and
normalized before being fed into algorithms.
Preprocessi Regex Pattern / Example Output
ng Task Operation

Remove re.sub(r'http\S+', "Visit "Visit "

URLs '', text) http://example.c
om"

Remove re.sub(r'[^A-Za-z0- "Hello, World!" "Hello World"

Special 9\s]', '', text)
Characters

Extract re.findall(r'\S+@\S "Contact me at ["[email protected]

Email +', text) [email protected] m"]
Addresses "

Replace re.sub(r'\d+', "My number is "My number is

Digits 'NUM', text) 12345" NUM"

Example: Removing Digits

text = "The price is 123 dollars."

clean_text = re.sub(r'\d+', 'NUM', text)
print(clean_text) # Output: "The price is NUM dollars."

1.5 Introduction to Natural Language Processing (NLP)

NLP involves enabling machines to understand, interpret, and generate human language. It
combines computer science, linguistics, and machine learning techniques.

Key Applications of NLP

Application Description Example

Chatbots Automate customer service Virtual assistants like Siri, Alexa

and support

Sentiment Classifying the sentiment of Analyzing movie reviews

Analysis text (positive/negative)

Machine Translating text between Google Translate

Translation languages
Speech Converting spoken language to Speech-to-text in Google Docs
Recognition text

Challenges in NLP

Challenge Description Example

Ambiguity Words or sentences with multiple "The bank is on the river bank." (bank as
meanings. financial or riverbank)

Variety Variations in language use across British English vs. American English:
dialects, regions, etc. colour vs. color

1.6 Role of Machine Learning in NLP

Modern NLP relies heavily on machine learning, particularly deep learning, to automatically
detect patterns in language. The following models are popular in NLP:

Model Description Example

Bag of Words (BoW) Text represented as a bag of 'I love NLP' → {'I':
individual words, ignoring order. 1, 'love': 1, 'NLP':
1}

TF-IDF Weighting scheme where frequent 'I love NLP' → weighted

but less important words are matrix
down-weighted.

RNN (Recurrent Models sequences and Used for machine translation

Neural Network) dependencies in text. or text generation

Transformers Advanced model that captures Used in GPT, BERT for tasks
global context across sentences. like summarization

Bag of Words Example

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(["I love NLP", "I love programming"])
print(X.toarray()) # Output: BoW matrix
1.7 Spacy Basics

Spacy is a popular NLP library in Python, known for its efficiency and ease of use. Key features
include tokenization, part-of-speech tagging, and named entity recognition.

Tokenization

Tokenization refers to splitting text into words or sentences.

Operation Example Code Output

Word Tokenization tokens = [token.text for ['I', 'love', 'NLP']

token in doc]

Sentence sentences = list(doc.sents) ['I love NLP.', 'It is

Tokenization amazing.']

Example: Tokenization

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Natural Language Processing is exciting!")
tokens = [token.text for token in doc]
print(tokens) # Output: ['Natural', 'Language', 'Processing', 'is',
'exciting', '!']

1.8 Stemming, Lemmatization, Stop Words

Operation Description Example Code Output

Stemming Reducing words to stemmer.stem("running") 'run'

their root form.

Lemmatization Converting words to [token.lemma_ for token in ['run',

their base dictionary doc] 'be']
form.
Stop Words Common words that [token for token in doc if List of
can be removed not token.is_stop] non-stop
during processing. words

Example: Lemmatization

doc = nlp("The children are playing.")

lemmas = [token.lemma_ for token in doc]
print(lemmas) # Output: ['the', 'child', 'be', 'play', '.']

1.9 Phrase Matching and Vocabulary

Phrase matching is used to search for multi-word expressions in text, which are often significant
in NLP tasks like entity recognition or keyword extraction.

Example: Phrase Matching

from spacy.matcher import PhraseMatcher

matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp(text) for text in ["machine learning", "natural
language processing"]]
matcher.add("TechTerms", None, *patterns)

doc = nlp("I love machine learning and natural language processing.")

matches = matcher(doc)
for match_id, start, end in matches:
print(doc[start:end].text) # Output: 'machine learning', 'natural
language processing'

Unit 2: Part of Speech Tagging and Named Entity Recognition (NER)

2.1 Part of Speech Tagging (POS)

POS Tagging is the process of labeling each word in a sentence with its respective part of
speech, such as noun, verb, adjective, etc. POS tagging is a fundamental part of many NLP
tasks, including syntactic parsing and word-sense disambiguation.

POS Tagging in Spacy

● Spacy automatically assigns POS tags using its built-in model, which labels words with
their grammatical roles.

import spacy
nlp = spacy.load("en_core_web_sm")

sentence = "Apple is looking at buying a U.K. startup."

doc = nlp(sentence)

for token in doc:

print(f"{token.text} -> {token.pos_} ({token.tag_})")

Common POS Tags

POS Tag Full Form Example Description

NOUN Noun Apple A person, place, thing, or idea

VERB Verb buying Action or state of being

ADJ Adjective startup Describes a noun

PROPN Proper Noun U.K. Specific names of people, places

ADV Adverb quickly Modifies a verb, adjective, or adverb

AUX Auxiliary Verb is Helps form different tenses

Example Output:
rust

Apple -> PROPN (NNP)

is -> AUX (VBZ)
looking -> VERB (VBG)
at -> ADP (IN)
buying -> VERB (VBG)
a -> DET (DT)
U.K. -> PROPN (NNP)
startup -> NOUN (NN)

POS Tagging vs Named Entity Recognition

Feature POS Tagging Named Entity Recognition (NER)

Purpose Labels words as nouns, verbs, etc. Identifies proper nouns and classifies them
(e.g., person, organization)

Example Verb (run), Noun (book) Person (John), Organization (Google),

Location (Paris)

Use Syntactic parsing, understanding Identifying named entities in text for

Cases sentence structure information extraction

2.2 Named Entity Recognition (NER)

Named Entity Recognition (NER) identifies and classifies named entities mentioned in the text
into predefined categories, such as persons, organizations, locations, dates, etc.

Example of NER in Spacy

doc = nlp("Apple is looking at buying a startup in the U.K.")

for ent in doc.ents:
print(ent.text, ent.label_)

NER Labels and Their Meanings

Entity Label Full Form Example Description

PERSON Person Elon Musk Recognizes people’s names

ORG Organization Apple, Google Recognizes corporate organizations

GPE Geopolitical Entity U.K., Germany Recognizes countries, cities, states

DATE Date July 2020 Recognizes dates

MONEY Monetary Value $500 Recognizes currency values

Example Output:
Apple -> ORG
U.K. -> GPE

Comparison of POS Tagging and NER

Feature POS Tagging Named Entity Recognition (NER)

Purpose Assigns part-of-speech labels Identifies and categorizes named entities

to tokens

Example Verb (run), Noun (city) Person (Elon Musk), Organization (Apple),
GPE (U.K.)

Applications Language structure analysis Information extraction, Named entity

categorization

2.3 Sentence Segmentation

Sentence segmentation is the process of splitting text into individual sentences. It is a critical
step in NLP for understanding sentence boundaries and structure.

Example of Sentence Segmentation in Spacy

text = "Hello! How are you? I'm doing well."

doc = nlp(text)

for sent in doc.sents:

print(sent.text)

Example Output:

Hello!
How are you?
I'm doing well.

Techniques for Sentence Segmentation

Technique Description Example

Rule-based Uses punctuation and specific markers (e.g., split based on . or ?

periods, question marks) to split sentences.

ML-based Uses machine learning models to learn Models trained on annotated

sentence boundaries. corpora to detect sentence
ends.

2.4 Text Modeling using the Bag of Words Model

The Bag of Words (BoW) model represents text data as a collection of words, ignoring grammar
and word order but maintaining frequency counts of each word.

Bag of Words Example

python

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["I love NLP", "NLP is amazing", "I love programming"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray()) # Outputs the BoW matrix

Bag of Words Matrix

Sentence I love NLP is amazing programming

I love NLP 1 1 1 0 0 0

NLP is amazing 0 0 1 1 1 0

I love programming 1 1 0 0 0 1

Advantages and Limitations of Bag of Words

Advantages Limitations

Simple and easy to implement Ignores word order

Works well for simple text classification tasks Does not capture semantic meaning of words
2.5 Text Modeling using the TF-IDF Model

TF-IDF (Term Frequency-Inverse Document Frequency) is an advanced text representation

model that weighs terms based on their frequency in a document and their inverse frequency in
the entire corpus. This reduces the weight of common terms like “the” and “is.”

TF-IDF Formula:

● Term Frequency (TF) = Number of occurrences of the word in a documentTotal number

of words in the document\frac{\text{Number of occurrences of the word in a
document}}{\text{Total number of words in the document}}Total number of words in the
documentNumber of occurrences of the word in a document
● Inverse Document Frequency (IDF) = log⁡(Total number of documentsNumber of
documents containing the word)\log \left( \frac{\text{Total number of
documents}}{\text{Number of documents containing the word}} \right)log(Number of
documents containing the wordTotal number of documents)
● TF-IDF = TF * IDF

TF-IDF Example in Python

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
print(X_tfidf.toarray()) # Outputs the TF-IDF matrix

Comparison of BoW and TF-IDF

Model Description Use Case

Bag of Represents text as a collection of Simple text classification tasks.

Words word frequencies.

TF-IDF Weighs words by frequency and Better for tasks where word significance
importance in the corpus. matters, like information retrieval.

2.6 Understanding the N-Gram Model

An N-Gram is a contiguous sequence of n items (words, characters, etc.) from a given text.
N-Grams capture local context by analyzing adjacent words or characters.
Types of N-Grams

N-Gram Type Example

Unigram (n=1) "I", "love", "NLP"

Bigram (n=2) "I love", "love NLP"

Trigram (n=3) "I love NLP", "love NLP courses"

Example: Generating Bigrams

from sklearn.feature_extraction.text import CountVectorizer

bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
X_bigram = bigram_vectorizer.fit_transform(corpus)
print(bigram_vectorizer.get_feature_names_out()) # Outputs list of
bigrams

N-Gram Applications

● Unigrams: Often used in simple text classification tasks.

● Bigrams/Trigrams: Useful in language models where word context is important (e.g.,
machine translation, speech recognition).

Example N-Gram Usage:

Sentence: "I love NLP"

Unigrams: ["I", "love", "NLP"]
Bigrams: ["I love", "love NLP"]

2.7 Latent Semantic Analysis (LSA)

Latent Semantic Analysis (LSA) is a technique in natural language processing that helps
discover the underlying structure of relationships between terms and documents. LSA reduces
the dimensionality of text data by transforming it into a lower-dimensional space using Singular
Value Decomposition (SVD). This technique is useful for text clustering, topic modeling, and
document similarity.

Steps in LSA:

1. Construct the Term-Document Matrix (using BoW or TF-IDF).

2. Apply Singular Value Decomposition (SVD) to decompose the matrix into three matrices:
U, Σ, and V.
3. Reduce the dimensionality by selecting the top k components from the decomposition.

Formula:
A=UΣVTA = U \Sigma V^TA=UΣVT

Where:

● A is the original matrix.

● U is the matrix representing terms.
● Σ is the diagonal matrix representing the singular values.
● V^T is the matrix representing documents.

Example: Applying LSA in Python

python

from sklearn.decomposition import TruncatedSVD

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["The dog barked at the mailman.",

"The cat meowed at the dog.",
"The mailman ran from the dog."]

# Convert corpus into a TF-IDF matrix

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus)

# Perform SVD (LSA)

svd = TruncatedSVD(n_components=2)
X_lsa = svd.fit_transform(X)

# Output the LSA-reduced matrix

print(X_lsa)

LSA Applications:

● Topic Modeling: Identifying the underlying topics in a collection of documents.

● Information Retrieval: Improving search engine performance by finding documents with
similar meanings.
2.8 Word Synonyms and Antonyms using NLTK

In NLP, synonyms are words that have similar meanings, while antonyms are words with
opposite meanings. The NLTK (Natural Language Toolkit) provides a built-in lexical database
called WordNet to fetch synonyms and antonyms for any word.

Example: Finding Synonyms and Antonyms with NLTK

python

from nltk.corpus import wordnet

# Synonyms for "happy"

synonyms = []
for syn in wordnet.synsets("happy"):
for lemma in syn.lemmas():
synonyms.append(lemma.name())

# Antonyms for "happy"

antonyms = []
for syn in wordnet.synsets("happy"):
for lemma in syn.lemmas():
if lemma.antonyms():
antonyms.append(lemma.antonyms()[0].name())

print(f"Synonyms: {set(synonyms)}")
print(f"Antonyms: {set(antonyms)}")

Example Output:
arduino

Synonyms: {'felicitous', 'glad', 'happy'}

Antonyms: {'unhappy'}

Applications of Synonyms and Antonyms in NLP:

● Thesaurus generation.
● Word-sense disambiguation.
● Improving semantic search.
2.9 Word Negation Tracking

Word Negation Tracking refers to identifying and understanding negation in a sentence. Words
like "not", "never", "no", or "none" can drastically change the meaning of a sentence. Handling
negations is crucial for tasks like sentiment analysis or intent recognition.

Example: Negation Handling

python

import nltk
from nltk.tokenize import word_tokenize

def negate_sentence(sentence):
tokens = word_tokenize(sentence)
negation = False
result = []

for token in tokens:

if token.lower() in ["not", "never", "no"]:
negation = True
elif token == ".":
negation = False
result.append("NOT_" + token if negation else token)

return " ".join(result)

# Example sentence
sentence = "I am not happy with the service."
negated_sentence = negate_sentence(sentence)
print(negated_sentence) # Output: 'I am NOT_happy with the service .'

Applications of Negation Tracking:

● Sentiment Analysis: Identifying positive and negative opinions more accurately.

● Intent Recognition: Understanding when users are making negative statements.
Unit 3: Text Classification and Text Summarization

3.1 Text Classification

Text classification is the process of assigning labels or categories to a piece of text based on its
content. This is widely used in tasks like sentiment analysis, spam detection, and topic
classification.

Steps in Text Classification:

1. Get the Data: Collect or import the dataset.

2. Data Preprocessing: Clean and preprocess the text (remove punctuation, stop words,
etc.).
3. Transform into BoW/TF-IDF Model: Convert the text into a vector representation.
4. Train the Model: Use classification algorithms like Logistic Regression, SVM, Naive
Bayes, etc.
5. Test the Model: Evaluate the model's performance using metrics like accuracy,
precision, recall.

Example: Text Classification with Logistic Regression

python

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample data
corpus = ["I love this product!", "This is the worst experience!",
"Absolutely fantastic service!", "Terrible customer support."]
labels = [1, 0, 1, 0] # 1=positive, 0=negative

# Preprocessing: TF-IDF vectorization

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Split data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, labels,
test_size=0.25, random_state=42)

# Train Logistic Regression classifier

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

# Predict and evaluate

y_pred = classifier.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Common Text Classification Algorithms:

Algorithm Description Example Use

Case

Logistic Regression A simple linear model for binary classification. Spam vs

Non-Spam

Naive Bayes A probabilistic classifier based on Bayes' Sentiment

theorem. analysis

Support Vector A robust linear classifier that finds the decision Fake news
Machines boundary. detection

3.2 Text Summarization

Text summarization is the process of reducing the length of a document while preserving its key
information. There are two main types of text summarization: Extractive Summarization and
Abstractive Summarization.

● Extractive Summarization: Selects key sentences or phrases directly from the original
text.
● Abstractive Summarization: Generates new sentences that summarize the content
(like how humans summarize text).

Steps in Extractive Summarization:

1. Fetch and Preprocess the Data: Get the document and clean it.
2. Tokenization: Split the text into sentences.
3. Build a Histogram: Calculate the frequency of each word.
4. Calculate Sentence Scores: Score each sentence based on the significance of its
words.
5. Select Sentences: Choose top N sentences for the summary.

Example: Extractive Summarization using NLTK

python

from nltk.tokenize import sent_tokenize, word_tokenize

from nltk.corpus import stopwords
from collections import defaultdict

# Sample text
text = """Natural Language Processing is an exciting field of
Artificial Intelligence.
It enables machines to understand and process human language. It is
widely used in chatbots, language translation, and many other
applications."""

# Step 1: Tokenize sentences

sentences = sent_tokenize(text)

# Step 2: Preprocess words

stop_words = set(stopwords.words('english'))
word_frequencies = defaultdict(int)

for word in word_tokenize(text):

if word.lower() not in stop_words and word.isalpha():
word_frequencies[word.lower()] += 1

# Step 3: Calculate sentence scores

sentence_scores = {}
for sentence in sentences:
for word in word_tokenize(sentence):
if word.lower() in word_frequencies:
if sentence not in sentence_scores:
sentence_scores[sentence] =
word_frequencies[word.lower()]
else:
sentence_scores[sentence] +=
word_frequencies[word.lower()]

# Step 4: Select top sentences for summary

summary = sorted(sentence_scores, key=sentence_scores.get,
reverse=True)[:2]
print("Summary:", " ".join(summary))

Applications of Text Summarization:

● News Summarization: Condensing lengthy news articles.

● Research Papers: Generating brief abstracts for long papers.
● Automatic Meeting Notes: Summarizing meeting transcripts for key points.

Unit 4: Semantics and Sentiment Analysis

4.1 Introduction to Semantics and Sentiment Analysis

● Semantics in NLP deals with the meaning and interpretation of words, phrases,
sentences, and larger units of text. It helps understand context, disambiguate word
meanings, and identify relationships between entities.
● Sentiment Analysis focuses on determining the emotional tone behind a body of text,
identifying whether it is positive, negative, or neutral.

4.2 Semantics and Word Vectors

Word vectors (also known as word embeddings) are numerical representations of words in a
high-dimensional space. Words that share similar contexts in a corpus tend to be closer in this
vector space. Word vectors enable semantic analysis by capturing relationships like:

● Synonymy: Words with similar meanings.

● Analogy: Word relationships (e.g., "king" is to "queen" as "man" is to "woman").

Common Word Embedding Models:

● Word2Vec: Converts words into dense vectors by training on large corpora using two
architectures: Continuous Bag of Words (CBOW) and Skip-Gram.
● GloVe (Global Vectors for Word Representation): Generates word embeddings by
factorizing word co-occurrence matrices.
● FastText: Builds on Word2Vec, but models sub-word information, making it better for
handling rare words.

Word2Vec Example:
python

import gensim
from gensim.models import Word2Vec
# Example corpus
sentences = [["I", "love", "NLP"], ["NLP", "is", "fun"], ["I",
"enjoy", "learning", "NLP"]]

# Train a Word2Vec model

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1,
workers=4)

# Get word vector for "NLP"

print(model.wv['NLP'])

# Finding most similar words to "NLP"

print(model.wv.most_similar('NLP'))

Example Output (Most Similar Words to 'NLP'):

css

[('love', 0.832), ('learning', 0.810), ('fun', 0.750)]

Analogy Example:
python

# Finding analogy: "king" is to "man" as "queen" is to ?

model.wv.most_similar(positive=['king', 'woman'], negative=['man'],
topn=1)

4.3 Sentiment Analysis Overview

Sentiment Analysis is the task of analyzing a piece of text to determine the underlying
sentiment or opinion. It can classify text as positive, negative, or neutral. Sentiment analysis is
widely used in:

● Product reviews: Identifying customer opinions on products.

● Social media: Analyzing user feedback and reactions.
● Customer service: Gauging user satisfaction from responses.

Common Approaches for Sentiment Analysis:

Method Description Example

Rule-Based Uses predefined lists of positive/negative Lexicon-based

words and rules (SentiWordNet)

Machine Classifies sentiment using machine learning Logistic Regression, Naive

Learning algorithms Bayes

Deep Learning Uses neural networks to automatically learn LSTM, RNNs,

sentiment patterns Transformers

Sentiment Score Example (Lexicon-Based):

Each word is assigned a sentiment score based on its polarity (positive or negative).

● "I love this product!" → Positive sentiment (words like "love" have positive polarity).
● "The service was terrible." → Negative sentiment (words like "terrible" have negative
polarity).

4.4 Sentiment Analysis with NLTK

The Natural Language Toolkit (NLTK) provides tools for simple sentiment analysis. The
VADER (Valence Aware Dictionary for Sentiment Reasoning) model, built into NLTK, is
commonly used for lexicon-based sentiment analysis.

Example: Sentiment Analysis using NLTK's VADER

python

from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Initialize VADER sentiment analyzer

analyzer = SentimentIntensityAnalyzer()

# Example sentence
sentence = "I love this product, but the delivery was terrible."

# Calculate sentiment scores

sentiment_scores = analyzer.polarity_scores(sentence)
print(sentiment_scores) # Outputs a dictionary of scores
Example Output:
arduino

{'neg': 0.344, 'neu': 0.493, 'pos': 0.163, 'compound': -0.1531}

● neg: Negative sentiment score

● neu: Neutral sentiment score
● pos: Positive sentiment score
● compound: Overall sentiment (ranges from -1 to +1, where -1 is very negative and +1 is
very positive)

4.5 Sentiment Analysis Movie Review Project

Objective: Classify movie reviews as positive or negative.

Steps for the Sentiment Analysis Project:

1. Data Collection: Use the IMDb movie review dataset, which contains reviews labeled as
positive or negative.
2. Preprocessing: Clean the text by removing stop words, punctuation, and performing
tokenization.
3. Feature Extraction: Convert text into numerical format using Bag of Words or TF-IDF.
4. Training the Model: Train a machine learning model such as Logistic Regression or
Naive Bayes.
5. Evaluating the Model: Evaluate the model using metrics such as accuracy, precision,
recall, and F1 score.

Example Project Pipeline:

python

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample movie review data

data = pd.read_csv('IMDB_Dataset.csv')
X = data['review']
y = data['sentiment'].map({'positive': 1, 'negative': 0})
# Preprocessing: Vectorize text using TF-IDF
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = tfidf.fit_transform(X)

# Split into training and test set

X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y,
test_size=0.2, random_state=42)

# Train the Logistic Regression model

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

# Predict on the test set and evaluate

y_pred = classifier.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

4.6 Twitter Sentiment Analysis

Twitter Sentiment Analysis focuses on analyzing the sentiment of tweets in real-time. Since
tweets are brief and often contain informal language, they pose unique challenges for NLP. This
project involves fetching live tweets, processing them, and predicting their sentiment.

Steps for Twitter Sentiment Analysis:

1. Set up the Twitter Application: Create a Twitter developer account and get access
tokens and API keys.
2. Fetch Real-Time Tweets: Use the tweepy library to fetch tweets based on specific
hashtags or keywords.
3. Preprocessing the Tweets: Clean tweets by removing URLs, hashtags, mentions, and
special characters.
4. Predicting Sentiment: Load a pre-trained sentiment analysis model (e.g., TF-IDF and
Logistic Regression) to classify the sentiment of each tweet.
5. Visualizing Results: Plot the distribution of sentiments (positive, negative, neutral).

Example: Fetching Tweets with Tweepy and Sentiment Analysis

python

import tweepy
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Set up Tweepy API with your credentials
auth = tweepy.OAuthHandler('API_KEY', 'API_SECRET_KEY')
auth.set_access_token('ACCESS_TOKEN', 'ACCESS_SECRET')
api = tweepy.API(auth)

# Fetch tweets based on a keyword

tweets = api.search(q="product review", lang="en", count=100)

# Initialize VADER sentiment analyzer

analyzer = SentimentIntensityAnalyzer()

# Analyze sentiment of each tweet

for tweet in tweets:
text = tweet.text
sentiment = analyzer.polarity_scores(text)
print(f"Tweet: {text} | Sentiment: {sentiment['compound']}")

Applications of Twitter Sentiment Analysis:

● Brand monitoring: Understanding public sentiment about a product or service.

● Political sentiment analysis: Gauging public opinion on political issues.
● Market research: Analyzing customer feedback to improve products.

Visualizing Twitter Sentiment Results

Once the tweets are processed and classified, you can plot the sentiment distribution using
matplotlib or seaborn to visualize whether the majority of tweets are positive, negative, or
neutral.

Key Concepts Recap and Differences

Concept Definition Example Tools/Techniques

Word Vectors High-dimensional Word2Vec, GloVe Word2Vec, FastText

numeric
representations of
words.
Sentiment Classifies text based Positive or negative NLTK (VADER), Machine
Analysis on emotional tone. movie reviews Learning (Logistic
Regression)

Named Entity Identifies and Recognizing SpaCy, NLTK

Recognition classifies proper people, locations,
nouns. dates in a
sentence.

Text Condenses text into Summarizing an TF-IDF, Word

Summarization shorter, meaningful article into 3-4 key Frequencies
summaries. sentences

Unit 4: Semantics and Sentiment Analysis - Continued

4.7 Sentiment Analysis Movie Review Project (Expanded)

Objective: The goal of this project is to classify movie reviews as either positive or negative
using machine learning techniques. In this section, we'll break down the project pipeline into
detailed steps with relevant code, insights, and explanations.

Steps for Movie Review Sentiment Classification:

1. Data Collection:
○ We'll use the IMDb movie review dataset, a commonly used dataset for
sentiment analysis.
○ This dataset contains reviews labeled as positive or negative, which helps train
a classification model.

python

import pandas as pd

# Load IMDb dataset (CSV file with reviews and sentiment labels)
data = pd.read_csv('IMDB_Dataset.csv')

# Inspect the first few rows of the data

print(data.head())
2. Sample Data:
Review Sentiment
"I loved this movie! The acting was amazing and the story was gripping." Positive

"This was the worst movie I have ever seen. I regret watching it." Negative

"An absolute masterpiece with brilliant performances by the entire cast." Positive

"Terrible plot, bad acting, and a complete waste of time. Avoid this movie Negative
at all costs."

3. Preprocessing:
○ Before training the model, we need to clean the data:
■ Lowercasing: Convert all text to lowercase to ensure uniformity.
■ Removing Punctuation: Strip out punctuation marks that don’t carry
meaning.
■ Tokenization: Split text into individual words or tokens.
■ Stop Word Removal: Remove common words like "the", "is", "in", which
don’t contribute to sentiment.

python

from nltk.corpus import stopwords

import re
from nltk.tokenize import word_tokenize

# Preprocess function to clean and tokenize text

def preprocess_text(text):
# Remove punctuation and convert to lowercase
text = re.sub(r'[^\w\s]', '', text.lower())
# Tokenize
tokens = word_tokenize(text)
# Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
return ' '.join(tokens)

# Apply preprocessing to the dataset

data['cleaned_review'] = data['review'].apply(preprocess_text)
4. Example of Preprocessing: Original Review: "I loved this movie! The acting
was amazing and the story was gripping." Preprocessed: "loved movie
acting amazing story gripping"
5. Vectorization (Converting Text to Numerical Form):
○ We’ll use the TF-IDF (Term Frequency-Inverse Document Frequency) model
to convert the text into a numerical format that machine learning algorithms can
work with.

python

from sklearn.feature_extraction.text import TfidfVectorizer

# Convert text data into TF-IDF vectors

tfidf = TfidfVectorizer(max_features=5000) # Limit the vocabulary to
5000 words
X = tfidf.fit_transform(data['cleaned_review'])
y = data['sentiment'].map({'positive': 1, 'negative': 0}) # Map
'positive' to 1 and 'negative' to 0

6. Splitting the Data:

○ We divide the dataset into a training set (to train the model) and a test set (to
evaluate its performance).

python

from sklearn.model_selection import train_test_split

# Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

7. Training the Model:

○ We’ll use Logistic Regression as our classification model. Logistic regression is
well-suited for binary classification tasks.

python

from sklearn.linear_model import LogisticRegression

# Train the Logistic Regression classifier

classifier = LogisticRegression()
classifier.fit(X_train, y_train)

8. Testing and Evaluating the Model:

○ After training, we’ll evaluate the model using metrics like accuracy, precision,
recall, and F1-score.

python

from sklearn.metrics import accuracy_score, classification_report

# Predict on the test set

y_pred = classifier.predict(X_test)

# Evaluate the model

print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred, target_names=['Negative',
'Positive']))
Output Example:

Accuracy: 0.87
precision recall f1-score support

Negative 0.88 0.86 0.87 245

Positive 0.86 0.88 0.87 255

accuracy 0.87 500

4.8 Twitter Sentiment Analysis (Expanded)

Objective: Analyze the sentiment of real-time tweets on a specific topic or hashtag using the
Twitter API and classify them as positive, negative, or neutral.

Steps for Twitter Sentiment Analysis:

1. Setting Up the Twitter API:

○ First, create a Twitter Developer Account and get access tokens and API keys.
○ Use the Tweepy library to authenticate and fetch tweets.
python

import tweepy

# Replace these with your own API keys from Twitter Developer account
API_KEY = 'your_api_key'
API_SECRET = 'your_api_secret'
ACCESS_TOKEN = 'your_access_token'
ACCESS_SECRET = 'your_access_secret'

# Authenticate using Tweepy

auth = tweepy.OAuthHandler(API_KEY, API_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
api = tweepy.API(auth)

2.
3. Fetching Real-Time Tweets:
○ We can fetch tweets based on specific hashtags or keywords.

python

# Fetch tweets based on a hashtag

keyword = "#AI"
tweets = api.search(q=keyword, lang="en", count=100)

# Print the text of the first 5 tweets

for tweet in tweets[:5]:
print(tweet.text)

4. Preprocessing Tweets:
○ Just like the movie review dataset, we preprocess tweets to remove unnecessary
characters such as URLs, hashtags, mentions, and punctuation.

python

def preprocess_tweet(text):
text = re.sub(r"http\S+", "", text) # Remove URLs
text = re.sub(r"#\w+", "", text) # Remove hashtags
text = re.sub(r"@\w+", "", text) # Remove mentions
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
return text.lower()
# Preprocess tweets
cleaned_tweets = [preprocess_tweet(tweet.text) for tweet in tweets]

5. Sentiment Analysis of Tweets:

○ We’ll use the VADER sentiment analyzer from NLTK to classify the sentiment of
each tweet as positive, negative, or neutral.

python

from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Initialize the VADER sentiment analyzer

sid = SentimentIntensityAnalyzer()

# Analyze the sentiment of each tweet

for tweet in cleaned_tweets:
sentiment = sid.polarity_scores(tweet)
print(f"Tweet: {tweet} | Sentiment: {sentiment['compound']}")

6. Visualizing Sentiment Distribution:

○ You can use matplotlib to visualize the distribution of sentiments (positive,
negative, neutral) across the tweets.

python

import matplotlib.pyplot as plt

# Count the number of positive, neutral, and negative sentiments

sentiment_counts = {'positive': 0, 'neutral': 0, 'negative': 0}

for tweet in cleaned_tweets:

sentiment = sid.polarity_scores(tweet)
if sentiment['compound'] >= 0.05:
sentiment_counts['positive'] += 1
elif sentiment['compound'] <= -0.05:
sentiment_counts['negative'] += 1
else:
sentiment_counts['neutral'] += 1
# Plot the sentiment distribution
labels = ['Positive', 'Neutral', 'Negative']
sizes = [sentiment_counts['positive'], sentiment_counts['neutral'],
sentiment_counts['negative']]
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90,
colors=['green', 'gray', 'red'])
plt.axis('equal') # Equal aspect ratio ensures that pie chart is
drawn as a circle.
plt.title(f"Sentiment Distribution for {keyword} Tweets")
plt.show()

7. Applications of Twitter Sentiment Analysis:

● Brand Monitoring: Tracking how users perceive a brand or product based on real-time
feedback on social media.
● Political Sentiment: Analyzing public opinion on political issues or candidates.

Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Intro to OOP for Beginners
No ratings yet
Intro to OOP for Beginners
23 pages
Pic 1 To 3 Chaptersslide Shows 1-59
100% (2)
Pic 1 To 3 Chaptersslide Shows 1-59
59 pages
1) Introduction To Computer Programming: Dr. E. Lang
No ratings yet
1) Introduction To Computer Programming: Dr. E. Lang
52 pages
A271864885 20248 5 2025 1 Variables
No ratings yet
A271864885 20248 5 2025 1 Variables
75 pages
Unit 4
No ratings yet
Unit 4
31 pages
Python Unit-3 Question Bank
No ratings yet
Python Unit-3 Question Bank
88 pages
RSPP En-Us SG m05 Progintro
No ratings yet
RSPP En-Us SG m05 Progintro
67 pages
02 - Unit 2. Data Types
No ratings yet
02 - Unit 2. Data Types
64 pages
Understanding Java Packages and Usage
No ratings yet
Understanding Java Packages and Usage
22 pages
User Input Processing in Python
No ratings yet
User Input Processing in Python
67 pages
NLP Text Representation Guide
No ratings yet
NLP Text Representation Guide
131 pages
AWK Part-3
100% (1)
AWK Part-3
10 pages
File Input Output Python
No ratings yet
File Input Output Python
30 pages
Java Exception Handling Explained
No ratings yet
Java Exception Handling Explained
12 pages
Introduction to C Programming Basics
100% (1)
Introduction to C Programming Basics
48 pages
Modularity in Object-Oriented Programming
No ratings yet
Modularity in Object-Oriented Programming
38 pages
UNIT 3 Queue
No ratings yet
UNIT 3 Queue
17 pages
Variables, Data Types, and Arithmetic Expressions: Dept. of Computer Science Faculty of Science and Technology
100% (1)
Variables, Data Types, and Arithmetic Expressions: Dept. of Computer Science Faculty of Science and Technology
19 pages
Data-Types & Operators in Python
No ratings yet
Data-Types & Operators in Python
42 pages
File System Case Study
No ratings yet
File System Case Study
23 pages
Python Notes
No ratings yet
Python Notes
103 pages
Data Structure in Python
No ratings yet
Data Structure in Python
36 pages
Exception Handling in Java
No ratings yet
Exception Handling in Java
19 pages
Python For Loops
No ratings yet
Python For Loops
1 page
R23-Java Lab Exercise
No ratings yet
R23-Java Lab Exercise
57 pages
Variables and Data Types
No ratings yet
Variables and Data Types
12 pages
Basic Queue Operations Explained
No ratings yet
Basic Queue Operations Explained
14 pages
Documenting Software Architecture
No ratings yet
Documenting Software Architecture
29 pages
BMC202 Object Oriented Programming Notes
No ratings yet
BMC202 Object Oriented Programming Notes
3 pages
L1 Intro To OOP
100% (1)
L1 Intro To OOP
15 pages
Btech-IT-semantic Nets (Compatibility Mode)
No ratings yet
Btech-IT-semantic Nets (Compatibility Mode)
33 pages
Python
No ratings yet
Python
3 pages
Unit 2
100% (1)
Unit 2
62 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
28 pages
Lec#10 ITCP
100% (1)
Lec#10 ITCP
8 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
21 pages
NLP Unit 1 Answers
No ratings yet
NLP Unit 1 Answers
7 pages
Python OOP: Classes, Inheritance, Polymorphism
No ratings yet
Python OOP: Classes, Inheritance, Polymorphism
27 pages
SVM Guide for Data Scientists
No ratings yet
SVM Guide for Data Scientists
24 pages
Data Types and Variables v1
100% (1)
Data Types and Variables v1
27 pages
Unit Ii
No ratings yet
Unit Ii
20 pages
Compiler Design: Type Checking Guide
No ratings yet
Compiler Design: Type Checking Guide
39 pages
Python Data Cleaning with Pandas
No ratings yet
Python Data Cleaning with Pandas
11 pages
Introduction to PROLOG Programming
No ratings yet
Introduction to PROLOG Programming
4 pages
vb6 Activex DLL Tutorial
No ratings yet
vb6 Activex DLL Tutorial
3 pages
Iit Data Science
No ratings yet
Iit Data Science
20 pages
What Is Switch Case
No ratings yet
What Is Switch Case
8 pages
Python PPT Lesson 4
No ratings yet
Python PPT Lesson 4
41 pages
Sessions 22, 23 Exception Handling
No ratings yet
Sessions 22, 23 Exception Handling
43 pages
1) Python Class and Objects: Creating Classes in Python
0% (1)
1) Python Class and Objects: Creating Classes in Python
16 pages
Chapter 3: Control Structures: 1. Higher Order Organization of Python Instructions
No ratings yet
Chapter 3: Control Structures: 1. Higher Order Organization of Python Instructions
7 pages
Understanding Stack Data Structure
No ratings yet
Understanding Stack Data Structure
40 pages
Array Index Exception Handling in Java
No ratings yet
Array Index Exception Handling in Java
51 pages
Data Structures for Beginners
100% (1)
Data Structures for Beginners
31 pages
Token Separation & Parsing Guide
82% (11)
Token Separation & Parsing Guide
47 pages
Oops Questions
100% (1)
Oops Questions
25 pages
NLP Full Overview
No ratings yet
NLP Full Overview
37 pages
NLP - Course EDC 1 29
No ratings yet
NLP - Course EDC 1 29
29 pages
Tutorial Text Classification in Phyton Using Spacy
No ratings yet
Tutorial Text Classification in Phyton Using Spacy
22 pages
Nursing PDFs and References
No ratings yet
Nursing PDFs and References
21 pages
Ix (Introduction To A.I. - Exercises)
No ratings yet
Ix (Introduction To A.I. - Exercises)
6 pages
AI and Robotics Complete Practice Set
No ratings yet
AI and Robotics Complete Practice Set
48 pages
Unit 1 BCA 101
No ratings yet
Unit 1 BCA 101
11 pages
Passwords
No ratings yet
Passwords
30 pages
SBI Clerk Prelims PDF Course 2023-1731928075219
No ratings yet
SBI Clerk Prelims PDF Course 2023-1731928075219
24 pages
My Web Code
No ratings yet
My Web Code
62 pages
156 Mbps Red LED Fiber Optic Device
No ratings yet
156 Mbps Red LED Fiber Optic Device
3 pages
NDG EH Lab 15
No ratings yet
NDG EH Lab 15
13 pages
RANAP Modifications for Lossless Relocation
No ratings yet
RANAP Modifications for Lossless Relocation
4 pages
LVDT/RVDT/Resolver Simulator: Pxi/Pxie
No ratings yet
LVDT/RVDT/Resolver Simulator: Pxi/Pxie
8 pages
PL/SQL Basics for Oracle Users
No ratings yet
PL/SQL Basics for Oracle Users
19 pages
NOSQL
No ratings yet
NOSQL
16 pages
Vigilance & Security Management
No ratings yet
Vigilance & Security Management
131 pages
TAW11 - 2 - ABAP Types and Data Objects
No ratings yet
TAW11 - 2 - ABAP Types and Data Objects
47 pages
2 Brochure - Mind Machine With Full Details
No ratings yet
2 Brochure - Mind Machine With Full Details
16 pages
Electromagnetic Flowmeter: Watermaster
No ratings yet
Electromagnetic Flowmeter: Watermaster
28 pages
Leica ScanStation C10 - C5 - SysField - en
No ratings yet
Leica ScanStation C10 - C5 - SysField - en
248 pages
ML for Identical Twin Prediction
No ratings yet
ML for Identical Twin Prediction
50 pages
Msa st300
No ratings yet
Msa st300
2 pages
EngagePlus Backend System 如何開始使用 EngagePlus
No ratings yet
EngagePlus Backend System 如何開始使用 EngagePlus
43 pages
AZ-500 Exam Q&A for IT Professionals
No ratings yet
AZ-500 Exam Q&A for IT Professionals
5 pages
Bug Life Cycle Explained
No ratings yet
Bug Life Cycle Explained
3 pages
Python Package Dependencies List
No ratings yet
Python Package Dependencies List
5 pages
Day 3.3 Exel
No ratings yet
Day 3.3 Exel
3 pages
Optimize Your Staff Accountant Resume
100% (1)
Optimize Your Staff Accountant Resume
6 pages
Ip Unit 4 One Shot
100% (1)
Ip Unit 4 One Shot
20 pages
Business Communication Report Example
No ratings yet
Business Communication Report Example
25 pages
A6V10320523 - NRC Network Ring Card (2nd Gen.) For FireFinder Pa - en
No ratings yet
A6V10320523 - NRC Network Ring Card (2nd Gen.) For FireFinder Pa - en
2 pages
HPE Nimble Storage Peer Persistence Deployment Considerations
No ratings yet
HPE Nimble Storage Peer Persistence Deployment Considerations
88 pages