Unit 1: Python Text and NLP Basics
1.1 Introduction to Python Text Basics
Python offers numerous tools for handling and manipulating text. The most basic of these are
string operations, but for more advanced tasks, we use libraries like re for regular expressions
and Spacy for NLP-specific functions.
Basic String Operations
Operation Example Code Output
Lowercasing text = "HELLO"; text.lower() 'hello'
Splitting text = "Hello, World!"; ['Hello', '
text.split(',') World!']
Concatenation a = "Hello"; b = "World"; c = a + " " 'Hello World'
+ b
Replacing text = "I am happy"; 'I am sad'
text.replace('happy', 'sad')
String operations in Python are efficient for simple text processing tasks, such as breaking a
sentence into words, converting text to lowercase, or replacing substrings.
File Handling in Python
Operation Description Example Code Output
Reading a Reads the entire content of file.read() Contents of
File the file. file
Reading Reads the file line by line and file.readlines() List of lines in
Lines returns a list. file
Writing to a Writes data to the file file.write("Hello -
File (overwrites existing data). World")
Example:
with open('example.txt', 'r', encoding='utf-8') as file:
content = file.read()
print(content) # Output: contents of 'example.txt'
1.2 Working with PDFs
PDF Text Extraction
Python libraries such as PyPDF2 and pdfminer.six are commonly used to extract text from
PDF documents.
Library Description Example Code Output
Example
PyPDF2 Simple library text = Text from
to extract PDF pdf_reader.getPage(0).extractText( page 1 of
text. ) PDF
pdfminer.six More powerful extract_text('file.pdf') Complete
for text text from
extraction. PDF
Example:
import PyPDF2
with open('sample.pdf', 'rb') as pdf_file:
reader = PyPDF2.PdfFileReader(pdf_file)
text = reader.getPage(0).extractText()
print(text) # Outputs text from the first page
1.3 Introduction to Regular Expressions (Regex)
Regular expressions allow us to define complex patterns to search, match, or manipulate text.
Python’s re module provides a variety of functions to work with regex.
Regex Description Example Code Output
Operation
Finding Find all occurrences of a re.findall(r'\d+', ['123']
Patterns pattern in text. 'User123 data')
Substituting Replace parts of the text re.sub(r'\d+', 'ID', 'UserID'
Patterns that match a pattern. 'User123')
Shorthand Predefined classes for re.findall(r'\w+', ['Text',
Classes matching. 'Text 123') '123']
Character Matches a range of re.findall(r'[A-Z]', ['H', 'W']
Ranges characters. 'Hello World')
Common Regex Patterns
Pattern Meaning Example Matches
\d Any digit (0-9) \d+ "123" from "User123"
\w Any word character (a-z, A-Z, 0-9) \w+ "User123"
\s Any whitespace character \s+ " " (space)
[a-z] Any lowercase letter [a-z]+ "text" from "text"
Example: Removing Special Characters
python
import re
text = "Hello! Welcome to NLP 101."
clean_text = re.sub(r'[^A-Za-z\s]', '', text) # Removes anything that
is not a letter or space
print(clean_text) # Output: "Hello Welcome to NLP"
1.4 Preprocessing using Regex
Preprocessing is the crucial first step in any NLP pipeline, ensuring that the data is cleaned and
normalized before being fed into algorithms.
Preprocessi Regex Pattern / Example Output
ng Task Operation
Remove re.sub(r'http\S+', "Visit "Visit "
URLs '', text) http://example.c
om"
Remove re.sub(r'[^A-Za-z0- "Hello, World!" "Hello World"
Special 9\s]', '', text)
Characters
Extract re.findall(r'\S+@\S "Contact me at ["[email protected]
Email +', text)
[email protected] m"]
Addresses "
Replace re.sub(r'\d+', "My number is "My number is
Digits 'NUM', text) 12345" NUM"
Example: Removing Digits
text = "The price is 123 dollars."
clean_text = re.sub(r'\d+', 'NUM', text)
print(clean_text) # Output: "The price is NUM dollars."
1.5 Introduction to Natural Language Processing (NLP)
NLP involves enabling machines to understand, interpret, and generate human language. It
combines computer science, linguistics, and machine learning techniques.
Key Applications of NLP
Application Description Example
Chatbots Automate customer service Virtual assistants like Siri, Alexa
and support
Sentiment Classifying the sentiment of Analyzing movie reviews
Analysis text (positive/negative)
Machine Translating text between Google Translate
Translation languages
Speech Converting spoken language to Speech-to-text in Google Docs
Recognition text
Challenges in NLP
Challenge Description Example
Ambiguity Words or sentences with multiple "The bank is on the river bank." (bank as
meanings. financial or riverbank)
Variety Variations in language use across British English vs. American English:
dialects, regions, etc. colour vs. color
1.6 Role of Machine Learning in NLP
Modern NLP relies heavily on machine learning, particularly deep learning, to automatically
detect patterns in language. The following models are popular in NLP:
Model Description Example
Bag of Words (BoW) Text represented as a bag of 'I love NLP' → {'I':
individual words, ignoring order. 1, 'love': 1, 'NLP':
1}
TF-IDF Weighting scheme where frequent 'I love NLP' → weighted
but less important words are matrix
down-weighted.
RNN (Recurrent Models sequences and Used for machine translation
Neural Network) dependencies in text. or text generation
Transformers Advanced model that captures Used in GPT, BERT for tasks
global context across sentences. like summarization
Bag of Words Example
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(["I love NLP", "I love programming"])
print(X.toarray()) # Output: BoW matrix
1.7 Spacy Basics
Spacy is a popular NLP library in Python, known for its efficiency and ease of use. Key features
include tokenization, part-of-speech tagging, and named entity recognition.
Tokenization
Tokenization refers to splitting text into words or sentences.
Operation Example Code Output
Word Tokenization tokens = [token.text for ['I', 'love', 'NLP']
token in doc]
Sentence sentences = list(doc.sents) ['I love NLP.', 'It is
Tokenization amazing.']
Example: Tokenization
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Natural Language Processing is exciting!")
tokens = [token.text for token in doc]
print(tokens) # Output: ['Natural', 'Language', 'Processing', 'is',
'exciting', '!']
1.8 Stemming, Lemmatization, Stop Words
Operation Description Example Code Output
Stemming Reducing words to stemmer.stem("running") 'run'
their root form.
Lemmatization Converting words to [token.lemma_ for token in ['run',
their base dictionary doc] 'be']
form.
Stop Words Common words that [token for token in doc if List of
can be removed not token.is_stop] non-stop
during processing. words
Example: Lemmatization
doc = nlp("The children are playing.")
lemmas = [token.lemma_ for token in doc]
print(lemmas) # Output: ['the', 'child', 'be', 'play', '.']
1.9 Phrase Matching and Vocabulary
Phrase matching is used to search for multi-word expressions in text, which are often significant
in NLP tasks like entity recognition or keyword extraction.
Example: Phrase Matching
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp(text) for text in ["machine learning", "natural
language processing"]]
matcher.add("TechTerms", None, *patterns)
doc = nlp("I love machine learning and natural language processing.")
matches = matcher(doc)
for match_id, start, end in matches:
print(doc[start:end].text) # Output: 'machine learning', 'natural
language processing'
Unit 2: Part of Speech Tagging and Named Entity Recognition (NER)
2.1 Part of Speech Tagging (POS)
POS Tagging is the process of labeling each word in a sentence with its respective part of
speech, such as noun, verb, adjective, etc. POS tagging is a fundamental part of many NLP
tasks, including syntactic parsing and word-sense disambiguation.
POS Tagging in Spacy
● Spacy automatically assigns POS tags using its built-in model, which labels words with
their grammatical roles.
import spacy
nlp = spacy.load("en_core_web_sm")
sentence = "Apple is looking at buying a U.K. startup."
doc = nlp(sentence)
for token in doc:
print(f"{token.text} -> {token.pos_} ({token.tag_})")
Common POS Tags
POS Tag Full Form Example Description
NOUN Noun Apple A person, place, thing, or idea
VERB Verb buying Action or state of being
ADJ Adjective startup Describes a noun
PROPN Proper Noun U.K. Specific names of people, places
ADV Adverb quickly Modifies a verb, adjective, or adverb
AUX Auxiliary Verb is Helps form different tenses
Example Output:
rust
Apple -> PROPN (NNP)
is -> AUX (VBZ)
looking -> VERB (VBG)
at -> ADP (IN)
buying -> VERB (VBG)
a -> DET (DT)
U.K. -> PROPN (NNP)
startup -> NOUN (NN)
POS Tagging vs Named Entity Recognition
Feature POS Tagging Named Entity Recognition (NER)
Purpose Labels words as nouns, verbs, etc. Identifies proper nouns and classifies them
(e.g., person, organization)
Example Verb (run), Noun (book) Person (John), Organization (Google),
Location (Paris)
Use Syntactic parsing, understanding Identifying named entities in text for
Cases sentence structure information extraction
2.2 Named Entity Recognition (NER)
Named Entity Recognition (NER) identifies and classifies named entities mentioned in the text
into predefined categories, such as persons, organizations, locations, dates, etc.
Example of NER in Spacy
doc = nlp("Apple is looking at buying a startup in the U.K.")
for ent in doc.ents:
print(ent.text, ent.label_)
NER Labels and Their Meanings
Entity Label Full Form Example Description
PERSON Person Elon Musk Recognizes people’s names
ORG Organization Apple, Google Recognizes corporate organizations
GPE Geopolitical Entity U.K., Germany Recognizes countries, cities, states
DATE Date July 2020 Recognizes dates
MONEY Monetary Value $500 Recognizes currency values
Example Output:
Apple -> ORG
U.K. -> GPE
Comparison of POS Tagging and NER
Feature POS Tagging Named Entity Recognition (NER)
Purpose Assigns part-of-speech labels Identifies and categorizes named entities
to tokens
Example Verb (run), Noun (city) Person (Elon Musk), Organization (Apple),
GPE (U.K.)
Applications Language structure analysis Information extraction, Named entity
categorization
2.3 Sentence Segmentation
Sentence segmentation is the process of splitting text into individual sentences. It is a critical
step in NLP for understanding sentence boundaries and structure.
Example of Sentence Segmentation in Spacy
text = "Hello! How are you? I'm doing well."
doc = nlp(text)
for sent in doc.sents:
print(sent.text)
Example Output:
Hello!
How are you?
I'm doing well.
Techniques for Sentence Segmentation
Technique Description Example
Rule-based Uses punctuation and specific markers (e.g., split based on . or ?
periods, question marks) to split sentences.
ML-based Uses machine learning models to learn Models trained on annotated
sentence boundaries. corpora to detect sentence
ends.
2.4 Text Modeling using the Bag of Words Model
The Bag of Words (BoW) model represents text data as a collection of words, ignoring grammar
and word order but maintaining frequency counts of each word.
Bag of Words Example
python
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["I love NLP", "NLP is amazing", "I love programming"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray()) # Outputs the BoW matrix
Bag of Words Matrix
Sentence I love NLP is amazing programming
I love NLP 1 1 1 0 0 0
NLP is amazing 0 0 1 1 1 0
I love programming 1 1 0 0 0 1
Advantages and Limitations of Bag of Words
Advantages Limitations
Simple and easy to implement Ignores word order
Works well for simple text classification tasks Does not capture semantic meaning of words
2.5 Text Modeling using the TF-IDF Model
TF-IDF (Term Frequency-Inverse Document Frequency) is an advanced text representation
model that weighs terms based on their frequency in a document and their inverse frequency in
the entire corpus. This reduces the weight of common terms like “the” and “is.”
TF-IDF Formula:
● Term Frequency (TF) = Number of occurrences of the word in a documentTotal number
of words in the document\frac{\text{Number of occurrences of the word in a
document}}{\text{Total number of words in the document}}Total number of words in the
documentNumber of occurrences of the word in a document
● Inverse Document Frequency (IDF) = log(Total number of documentsNumber of
documents containing the word)\log \left( \frac{\text{Total number of
documents}}{\text{Number of documents containing the word}} \right)log(Number of
documents containing the wordTotal number of documents)
● TF-IDF = TF * IDF
TF-IDF Example in Python
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
print(X_tfidf.toarray()) # Outputs the TF-IDF matrix
Comparison of BoW and TF-IDF
Model Description Use Case
Bag of Represents text as a collection of Simple text classification tasks.
Words word frequencies.
TF-IDF Weighs words by frequency and Better for tasks where word significance
importance in the corpus. matters, like information retrieval.
2.6 Understanding the N-Gram Model
An N-Gram is a contiguous sequence of n items (words, characters, etc.) from a given text.
N-Grams capture local context by analyzing adjacent words or characters.
Types of N-Grams
N-Gram Type Example
Unigram (n=1) "I", "love", "NLP"
Bigram (n=2) "I love", "love NLP"
Trigram (n=3) "I love NLP", "love NLP courses"
Example: Generating Bigrams
from sklearn.feature_extraction.text import CountVectorizer
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
X_bigram = bigram_vectorizer.fit_transform(corpus)
print(bigram_vectorizer.get_feature_names_out()) # Outputs list of
bigrams
N-Gram Applications
● Unigrams: Often used in simple text classification tasks.
● Bigrams/Trigrams: Useful in language models where word context is important (e.g.,
machine translation, speech recognition).
Example N-Gram Usage:
Sentence: "I love NLP"
Unigrams: ["I", "love", "NLP"]
Bigrams: ["I love", "love NLP"]
2.7 Latent Semantic Analysis (LSA)
Latent Semantic Analysis (LSA) is a technique in natural language processing that helps
discover the underlying structure of relationships between terms and documents. LSA reduces
the dimensionality of text data by transforming it into a lower-dimensional space using Singular
Value Decomposition (SVD). This technique is useful for text clustering, topic modeling, and
document similarity.
Steps in LSA:
1. Construct the Term-Document Matrix (using BoW or TF-IDF).
2. Apply Singular Value Decomposition (SVD) to decompose the matrix into three matrices:
U, Σ, and V.
3. Reduce the dimensionality by selecting the top k components from the decomposition.
Formula:
A=UΣVTA = U \Sigma V^TA=UΣVT
Where:
● A is the original matrix.
● U is the matrix representing terms.
● Σ is the diagonal matrix representing the singular values.
● V^T is the matrix representing documents.
Example: Applying LSA in Python
python
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["The dog barked at the mailman.",
"The cat meowed at the dog.",
"The mailman ran from the dog."]
# Convert corpus into a TF-IDF matrix
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(corpus)
# Perform SVD (LSA)
svd = TruncatedSVD(n_components=2)
X_lsa = svd.fit_transform(X)
# Output the LSA-reduced matrix
print(X_lsa)
LSA Applications:
● Topic Modeling: Identifying the underlying topics in a collection of documents.
● Information Retrieval: Improving search engine performance by finding documents with
similar meanings.
2.8 Word Synonyms and Antonyms using NLTK
In NLP, synonyms are words that have similar meanings, while antonyms are words with
opposite meanings. The NLTK (Natural Language Toolkit) provides a built-in lexical database
called WordNet to fetch synonyms and antonyms for any word.
Example: Finding Synonyms and Antonyms with NLTK
python
from nltk.corpus import wordnet
# Synonyms for "happy"
synonyms = []
for syn in wordnet.synsets("happy"):
for lemma in syn.lemmas():
synonyms.append(lemma.name())
# Antonyms for "happy"
antonyms = []
for syn in wordnet.synsets("happy"):
for lemma in syn.lemmas():
if lemma.antonyms():
antonyms.append(lemma.antonyms()[0].name())
print(f"Synonyms: {set(synonyms)}")
print(f"Antonyms: {set(antonyms)}")
Example Output:
arduino
Synonyms: {'felicitous', 'glad', 'happy'}
Antonyms: {'unhappy'}
Applications of Synonyms and Antonyms in NLP:
● Thesaurus generation.
● Word-sense disambiguation.
● Improving semantic search.
2.9 Word Negation Tracking
Word Negation Tracking refers to identifying and understanding negation in a sentence. Words
like "not", "never", "no", or "none" can drastically change the meaning of a sentence. Handling
negations is crucial for tasks like sentiment analysis or intent recognition.
Example: Negation Handling
python
import nltk
from nltk.tokenize import word_tokenize
def negate_sentence(sentence):
tokens = word_tokenize(sentence)
negation = False
result = []
for token in tokens:
if token.lower() in ["not", "never", "no"]:
negation = True
elif token == ".":
negation = False
result.append("NOT_" + token if negation else token)
return " ".join(result)
# Example sentence
sentence = "I am not happy with the service."
negated_sentence = negate_sentence(sentence)
print(negated_sentence) # Output: 'I am NOT_happy with the service .'
Applications of Negation Tracking:
● Sentiment Analysis: Identifying positive and negative opinions more accurately.
● Intent Recognition: Understanding when users are making negative statements.
Unit 3: Text Classification and Text Summarization
3.1 Text Classification
Text classification is the process of assigning labels or categories to a piece of text based on its
content. This is widely used in tasks like sentiment analysis, spam detection, and topic
classification.
Steps in Text Classification:
1. Get the Data: Collect or import the dataset.
2. Data Preprocessing: Clean and preprocess the text (remove punctuation, stop words,
etc.).
3. Transform into BoW/TF-IDF Model: Convert the text into a vector representation.
4. Train the Model: Use classification algorithms like Logistic Regression, SVM, Naive
Bayes, etc.
5. Test the Model: Evaluate the model's performance using metrics like accuracy,
precision, recall.
Example: Text Classification with Logistic Regression
python
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Sample data
corpus = ["I love this product!", "This is the worst experience!",
"Absolutely fantastic service!", "Terrible customer support."]
labels = [1, 0, 1, 0] # 1=positive, 0=negative
# Preprocessing: TF-IDF vectorization
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, labels,
test_size=0.25, random_state=42)
# Train Logistic Regression classifier
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
# Predict and evaluate
y_pred = classifier.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
Common Text Classification Algorithms:
Algorithm Description Example Use
Case
Logistic Regression A simple linear model for binary classification. Spam vs
Non-Spam
Naive Bayes A probabilistic classifier based on Bayes' Sentiment
theorem. analysis
Support Vector A robust linear classifier that finds the decision Fake news
Machines boundary. detection
3.2 Text Summarization
Text summarization is the process of reducing the length of a document while preserving its key
information. There are two main types of text summarization: Extractive Summarization and
Abstractive Summarization.
● Extractive Summarization: Selects key sentences or phrases directly from the original
text.
● Abstractive Summarization: Generates new sentences that summarize the content
(like how humans summarize text).
Steps in Extractive Summarization:
1. Fetch and Preprocess the Data: Get the document and clean it.
2. Tokenization: Split the text into sentences.
3. Build a Histogram: Calculate the frequency of each word.
4. Calculate Sentence Scores: Score each sentence based on the significance of its
words.
5. Select Sentences: Choose top N sentences for the summary.
Example: Extractive Summarization using NLTK
python
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
# Sample text
text = """Natural Language Processing is an exciting field of
Artificial Intelligence.
It enables machines to understand and process human language. It is
widely used in chatbots, language translation, and many other
applications."""
# Step 1: Tokenize sentences
sentences = sent_tokenize(text)
# Step 2: Preprocess words
stop_words = set(stopwords.words('english'))
word_frequencies = defaultdict(int)
for word in word_tokenize(text):
if word.lower() not in stop_words and word.isalpha():
word_frequencies[word.lower()] += 1
# Step 3: Calculate sentence scores
sentence_scores = {}
for sentence in sentences:
for word in word_tokenize(sentence):
if word.lower() in word_frequencies:
if sentence not in sentence_scores:
sentence_scores[sentence] =
word_frequencies[word.lower()]
else:
sentence_scores[sentence] +=
word_frequencies[word.lower()]
# Step 4: Select top sentences for summary
summary = sorted(sentence_scores, key=sentence_scores.get,
reverse=True)[:2]
print("Summary:", " ".join(summary))
Applications of Text Summarization:
● News Summarization: Condensing lengthy news articles.
● Research Papers: Generating brief abstracts for long papers.
● Automatic Meeting Notes: Summarizing meeting transcripts for key points.
Unit 4: Semantics and Sentiment Analysis
4.1 Introduction to Semantics and Sentiment Analysis
● Semantics in NLP deals with the meaning and interpretation of words, phrases,
sentences, and larger units of text. It helps understand context, disambiguate word
meanings, and identify relationships between entities.
● Sentiment Analysis focuses on determining the emotional tone behind a body of text,
identifying whether it is positive, negative, or neutral.
4.2 Semantics and Word Vectors
Word vectors (also known as word embeddings) are numerical representations of words in a
high-dimensional space. Words that share similar contexts in a corpus tend to be closer in this
vector space. Word vectors enable semantic analysis by capturing relationships like:
● Synonymy: Words with similar meanings.
● Analogy: Word relationships (e.g., "king" is to "queen" as "man" is to "woman").
Common Word Embedding Models:
● Word2Vec: Converts words into dense vectors by training on large corpora using two
architectures: Continuous Bag of Words (CBOW) and Skip-Gram.
● GloVe (Global Vectors for Word Representation): Generates word embeddings by
factorizing word co-occurrence matrices.
● FastText: Builds on Word2Vec, but models sub-word information, making it better for
handling rare words.
Word2Vec Example:
python
import gensim
from gensim.models import Word2Vec
# Example corpus
sentences = [["I", "love", "NLP"], ["NLP", "is", "fun"], ["I",
"enjoy", "learning", "NLP"]]
# Train a Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1,
workers=4)
# Get word vector for "NLP"
print(model.wv['NLP'])
# Finding most similar words to "NLP"
print(model.wv.most_similar('NLP'))
Example Output (Most Similar Words to 'NLP'):
css
[('love', 0.832), ('learning', 0.810), ('fun', 0.750)]
Analogy Example:
python
# Finding analogy: "king" is to "man" as "queen" is to ?
model.wv.most_similar(positive=['king', 'woman'], negative=['man'],
topn=1)
4.3 Sentiment Analysis Overview
Sentiment Analysis is the task of analyzing a piece of text to determine the underlying
sentiment or opinion. It can classify text as positive, negative, or neutral. Sentiment analysis is
widely used in:
● Product reviews: Identifying customer opinions on products.
● Social media: Analyzing user feedback and reactions.
● Customer service: Gauging user satisfaction from responses.
Common Approaches for Sentiment Analysis:
Method Description Example
Rule-Based Uses predefined lists of positive/negative Lexicon-based
words and rules (SentiWordNet)
Machine Classifies sentiment using machine learning Logistic Regression, Naive
Learning algorithms Bayes
Deep Learning Uses neural networks to automatically learn LSTM, RNNs,
sentiment patterns Transformers
Sentiment Score Example (Lexicon-Based):
Each word is assigned a sentiment score based on its polarity (positive or negative).
● "I love this product!" → Positive sentiment (words like "love" have positive polarity).
● "The service was terrible." → Negative sentiment (words like "terrible" have negative
polarity).
4.4 Sentiment Analysis with NLTK
The Natural Language Toolkit (NLTK) provides tools for simple sentiment analysis. The
VADER (Valence Aware Dictionary for Sentiment Reasoning) model, built into NLTK, is
commonly used for lexicon-based sentiment analysis.
Example: Sentiment Analysis using NLTK's VADER
python
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Initialize VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()
# Example sentence
sentence = "I love this product, but the delivery was terrible."
# Calculate sentiment scores
sentiment_scores = analyzer.polarity_scores(sentence)
print(sentiment_scores) # Outputs a dictionary of scores
Example Output:
arduino
{'neg': 0.344, 'neu': 0.493, 'pos': 0.163, 'compound': -0.1531}
● neg: Negative sentiment score
● neu: Neutral sentiment score
● pos: Positive sentiment score
● compound: Overall sentiment (ranges from -1 to +1, where -1 is very negative and +1 is
very positive)
4.5 Sentiment Analysis Movie Review Project
Objective: Classify movie reviews as positive or negative.
Steps for the Sentiment Analysis Project:
1. Data Collection: Use the IMDb movie review dataset, which contains reviews labeled as
positive or negative.
2. Preprocessing: Clean the text by removing stop words, punctuation, and performing
tokenization.
3. Feature Extraction: Convert text into numerical format using Bag of Words or TF-IDF.
4. Training the Model: Train a machine learning model such as Logistic Regression or
Naive Bayes.
5. Evaluating the Model: Evaluate the model using metrics such as accuracy, precision,
recall, and F1 score.
Example Project Pipeline:
python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Sample movie review data
data = pd.read_csv('IMDB_Dataset.csv')
X = data['review']
y = data['sentiment'].map({'positive': 1, 'negative': 0})
# Preprocessing: Vectorize text using TF-IDF
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = tfidf.fit_transform(X)
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y,
test_size=0.2, random_state=42)
# Train the Logistic Regression model
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
# Predict on the test set and evaluate
y_pred = classifier.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
4.6 Twitter Sentiment Analysis
Twitter Sentiment Analysis focuses on analyzing the sentiment of tweets in real-time. Since
tweets are brief and often contain informal language, they pose unique challenges for NLP. This
project involves fetching live tweets, processing them, and predicting their sentiment.
Steps for Twitter Sentiment Analysis:
1. Set up the Twitter Application: Create a Twitter developer account and get access
tokens and API keys.
2. Fetch Real-Time Tweets: Use the tweepy library to fetch tweets based on specific
hashtags or keywords.
3. Preprocessing the Tweets: Clean tweets by removing URLs, hashtags, mentions, and
special characters.
4. Predicting Sentiment: Load a pre-trained sentiment analysis model (e.g., TF-IDF and
Logistic Regression) to classify the sentiment of each tweet.
5. Visualizing Results: Plot the distribution of sentiments (positive, negative, neutral).
Example: Fetching Tweets with Tweepy and Sentiment Analysis
python
import tweepy
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Set up Tweepy API with your credentials
auth = tweepy.OAuthHandler('API_KEY', 'API_SECRET_KEY')
auth.set_access_token('ACCESS_TOKEN', 'ACCESS_SECRET')
api = tweepy.API(auth)
# Fetch tweets based on a keyword
tweets = api.search(q="product review", lang="en", count=100)
# Initialize VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()
# Analyze sentiment of each tweet
for tweet in tweets:
text = tweet.text
sentiment = analyzer.polarity_scores(text)
print(f"Tweet: {text} | Sentiment: {sentiment['compound']}")
Applications of Twitter Sentiment Analysis:
● Brand monitoring: Understanding public sentiment about a product or service.
● Political sentiment analysis: Gauging public opinion on political issues.
● Market research: Analyzing customer feedback to improve products.
Visualizing Twitter Sentiment Results
Once the tweets are processed and classified, you can plot the sentiment distribution using
matplotlib or seaborn to visualize whether the majority of tweets are positive, negative, or
neutral.
Key Concepts Recap and Differences
Concept Definition Example Tools/Techniques
Word Vectors High-dimensional Word2Vec, GloVe Word2Vec, FastText
numeric
representations of
words.
Sentiment Classifies text based Positive or negative NLTK (VADER), Machine
Analysis on emotional tone. movie reviews Learning (Logistic
Regression)
Named Entity Identifies and Recognizing SpaCy, NLTK
Recognition classifies proper people, locations,
nouns. dates in a
sentence.
Text Condenses text into Summarizing an TF-IDF, Word
Summarization shorter, meaningful article into 3-4 key Frequencies
summaries. sentences
Unit 4: Semantics and Sentiment Analysis - Continued
4.7 Sentiment Analysis Movie Review Project (Expanded)
Objective: The goal of this project is to classify movie reviews as either positive or negative
using machine learning techniques. In this section, we'll break down the project pipeline into
detailed steps with relevant code, insights, and explanations.
Steps for Movie Review Sentiment Classification:
1. Data Collection:
○ We'll use the IMDb movie review dataset, a commonly used dataset for
sentiment analysis.
○ This dataset contains reviews labeled as positive or negative, which helps train
a classification model.
python
import pandas as pd
# Load IMDb dataset (CSV file with reviews and sentiment labels)
data = pd.read_csv('IMDB_Dataset.csv')
# Inspect the first few rows of the data
print(data.head())
2. Sample Data:
Review Sentiment
"I loved this movie! The acting was amazing and the story was gripping." Positive
"This was the worst movie I have ever seen. I regret watching it." Negative
"An absolute masterpiece with brilliant performances by the entire cast." Positive
"Terrible plot, bad acting, and a complete waste of time. Avoid this movie Negative
at all costs."
3. Preprocessing:
○ Before training the model, we need to clean the data:
■ Lowercasing: Convert all text to lowercase to ensure uniformity.
■ Removing Punctuation: Strip out punctuation marks that don’t carry
meaning.
■ Tokenization: Split text into individual words or tokens.
■ Stop Word Removal: Remove common words like "the", "is", "in", which
don’t contribute to sentiment.
python
from nltk.corpus import stopwords
import re
from nltk.tokenize import word_tokenize
# Preprocess function to clean and tokenize text
def preprocess_text(text):
# Remove punctuation and convert to lowercase
text = re.sub(r'[^\w\s]', '', text.lower())
# Tokenize
tokens = word_tokenize(text)
# Remove stop words
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if word not in stop_words]
return ' '.join(tokens)
# Apply preprocessing to the dataset
data['cleaned_review'] = data['review'].apply(preprocess_text)
4. Example of Preprocessing: Original Review: "I loved this movie! The acting
was amazing and the story was gripping." Preprocessed: "loved movie
acting amazing story gripping"
5. Vectorization (Converting Text to Numerical Form):
○ We’ll use the TF-IDF (Term Frequency-Inverse Document Frequency) model
to convert the text into a numerical format that machine learning algorithms can
work with.
python
from sklearn.feature_extraction.text import TfidfVectorizer
# Convert text data into TF-IDF vectors
tfidf = TfidfVectorizer(max_features=5000) # Limit the vocabulary to
5000 words
X = tfidf.fit_transform(data['cleaned_review'])
y = data['sentiment'].map({'positive': 1, 'negative': 0}) # Map
'positive' to 1 and 'negative' to 0
6. Splitting the Data:
○ We divide the dataset into a training set (to train the model) and a test set (to
evaluate its performance).
python
from sklearn.model_selection import train_test_split
# Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
7. Training the Model:
○ We’ll use Logistic Regression as our classification model. Logistic regression is
well-suited for binary classification tasks.
python
from sklearn.linear_model import LogisticRegression
# Train the Logistic Regression classifier
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
8. Testing and Evaluating the Model:
○ After training, we’ll evaluate the model using metrics like accuracy, precision,
recall, and F1-score.
python
from sklearn.metrics import accuracy_score, classification_report
# Predict on the test set
y_pred = classifier.predict(X_test)
# Evaluate the model
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred, target_names=['Negative',
'Positive']))
Output Example:
Accuracy: 0.87
precision recall f1-score support
Negative 0.88 0.86 0.87 245
Positive 0.86 0.88 0.87 255
accuracy 0.87 500
4.8 Twitter Sentiment Analysis (Expanded)
Objective: Analyze the sentiment of real-time tweets on a specific topic or hashtag using the
Twitter API and classify them as positive, negative, or neutral.
Steps for Twitter Sentiment Analysis:
1. Setting Up the Twitter API:
○ First, create a Twitter Developer Account and get access tokens and API keys.
○ Use the Tweepy library to authenticate and fetch tweets.
python
import tweepy
# Replace these with your own API keys from Twitter Developer account
API_KEY = 'your_api_key'
API_SECRET = 'your_api_secret'
ACCESS_TOKEN = 'your_access_token'
ACCESS_SECRET = 'your_access_secret'
# Authenticate using Tweepy
auth = tweepy.OAuthHandler(API_KEY, API_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
api = tweepy.API(auth)
2.
3. Fetching Real-Time Tweets:
○ We can fetch tweets based on specific hashtags or keywords.
python
# Fetch tweets based on a hashtag
keyword = "#AI"
tweets = api.search(q=keyword, lang="en", count=100)
# Print the text of the first 5 tweets
for tweet in tweets[:5]:
print(tweet.text)
4. Preprocessing Tweets:
○ Just like the movie review dataset, we preprocess tweets to remove unnecessary
characters such as URLs, hashtags, mentions, and punctuation.
python
def preprocess_tweet(text):
text = re.sub(r"http\S+", "", text) # Remove URLs
text = re.sub(r"#\w+", "", text) # Remove hashtags
text = re.sub(r"@\w+", "", text) # Remove mentions
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
return text.lower()
# Preprocess tweets
cleaned_tweets = [preprocess_tweet(tweet.text) for tweet in tweets]
5. Sentiment Analysis of Tweets:
○ We’ll use the VADER sentiment analyzer from NLTK to classify the sentiment of
each tweet as positive, negative, or neutral.
python
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Initialize the VADER sentiment analyzer
sid = SentimentIntensityAnalyzer()
# Analyze the sentiment of each tweet
for tweet in cleaned_tweets:
sentiment = sid.polarity_scores(tweet)
print(f"Tweet: {tweet} | Sentiment: {sentiment['compound']}")
6. Visualizing Sentiment Distribution:
○ You can use matplotlib to visualize the distribution of sentiments (positive,
negative, neutral) across the tweets.
python
import matplotlib.pyplot as plt
# Count the number of positive, neutral, and negative sentiments
sentiment_counts = {'positive': 0, 'neutral': 0, 'negative': 0}
for tweet in cleaned_tweets:
sentiment = sid.polarity_scores(tweet)
if sentiment['compound'] >= 0.05:
sentiment_counts['positive'] += 1
elif sentiment['compound'] <= -0.05:
sentiment_counts['negative'] += 1
else:
sentiment_counts['neutral'] += 1
# Plot the sentiment distribution
labels = ['Positive', 'Neutral', 'Negative']
sizes = [sentiment_counts['positive'], sentiment_counts['neutral'],
sentiment_counts['negative']]
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90,
colors=['green', 'gray', 'red'])
plt.axis('equal') # Equal aspect ratio ensures that pie chart is
drawn as a circle.
plt.title(f"Sentiment Distribution for {keyword} Tweets")
plt.show()
7. Applications of Twitter Sentiment Analysis:
● Brand Monitoring: Tracking how users perceive a brand or product based on real-time
feedback on social media.
● Political Sentiment: Analyzing public opinion on political issues or candidates.