NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
• Test Document:
• "predictable with no fun."
How to Solve in step by step
Step 1: Understand the Training Data
• Negative Reviews:
• "just plain boring"
• "entirely predictable and lacks energy"
• "no surprises and very few laughs"
• Positive Reviews:
• "very powerful"
• "the most fun film of the summer"
• Test Document:
• "predictable with no fun."
Step 2: Preprocess the Training Data
• Tokenize the Reviews:
• Negative Reviews:
• ["just", "plain", "boring"]
• ["entirely", "predictable", "and", "lacks", "energy"]
• ["no", "surprises", "and", "very", "few", "laughs"]
• Positive Reviews:
• ["very", "powerful"]
• ["the", "most", "fun", "film", "of", "the", "summer"]
• Build Vocabulary V:
• Union of all unique words in the training data:
• ["just", "plain", "boring", "entirely", "predictable", "and", "lacks", "energy", "no",
"surprises", "very", "few", "laughs", "powerful", "the", "most", "fun", "film", "of",
"summer"]
• Vocabulary size ∣V∣=20.
Step 3: Compute Class Priors P(c)
Step 4: Compute Word Likelihoods P(wi∣c)
Step 5: Preprocess the Test Document
• Test Document:
• "predictable with no fun."
• Remove Unknown Words:
• "with" is not in the vocabulary → remove it.
• Test Document After Removal:
• "predictable no fun."
Step 6: Calculate Class Probabilities
Step 7: Make the Classification Decision
Handling Unknown Words
Tasks:
1.Identify the unknown words in the test document (if any) and remove
them.
2.Compute the class priors P(c) for spam and ham.
3.Compute the word likelihoods P(wi∣c) for the remaining words using
Laplace smoothing.
4.Calculate P(c∣d) for both classes and classify the test document.
Step 1: Preprocess the Training Data
Total Words in Each Class:
• Spam: 7 words.
• Ham: 5 words.
Step 2: Preprocess the Test Document
• Test Document:
• "free lunch and prize"
• Tokenize:
• ["free", "lunch", "and", "prize"]
• Identify Unknown Words:
• "and" is not in the vocabulary → remove it.
• Test Document After Removal:
• ["free", "lunch", "prize"]
Step 3: Compute Class Priors P(c)
Step 4: Compute Word Likelihoods P(wi∣c)
Step 5: Calculate Class Probabilities P(c∣d)
Step 6: Classification Decision
Worked Examples
Multinomial Naive Bayes Classifier
• The Multinomial Naive Bayes (MNB) classifier is a probabilistic
machine learning algorithm.
• Commonly used for text classification tasks, such as spam detection,
sentiment analysis, and document categorization.
• It is based on Bayes' Theorem and makes the "naive" assumption that
the features (words) are conditionally independent given the class.
Optimizing for Sentiment Analysis
• Sentiment analysis is the task of determining the sentiment
(positive, negative, or neutral) expressed in a piece of text,
such as a review, tweet, or comment.
• While the standard Multinomial Naive Bayes
(MNB) classifier works well for sentiment analysis, there
are some optimizations that can improve its performance.
• These optimizations include:
1.Binary Naive Bayes
2.Handling Negation
3.Using Sentiment Lexicons
Why Optimization is Required in Sentiment
Analysis
• While the standard Multinomial Naive Bayes (MNB) classifier
works well for many text classification tasks, sentiment analysis
has some unique challenges that require optimizations to
improve performance.
• Eg:
• Document 1 : “I like that movie”
• Document 2 : “I didn’t like that movie”
• Naïve Bayes will treat both document as positive
1. Binary Naive Bayes
What is Binary Naive Bayes?
• Binary Naive Bayes is a variant of the standard Multinomial Naive
Bayes classifier.
• Instead of using word frequencies (how many times a word appears
in a document), it focuses on word presence (whether a word
appears or not).
• This means that even if a word appears multiple times in a document,
it is treated as if it appeared only once.
Why Use Binary Naive Bayes?
• In sentiment analysis, the presence of a word (e.g., "love" or "hate")
is often more important than its frequency.
• For example, saying "I love this movie" once or multiple times still
conveys the same sentiment.
• By ignoring word frequency, binary Naive Bayes reduces noise and
improves performance for sentiment tasks.
How It Works:
• During training and testing, duplicate words in a document are
removed.
• Example:
• Original document: "great great film" → Binary representation: "great film".
Example:
Training Data:
• Positive: "great film"
• Negative: "boring movie"
Test Document:
• "great great movie"
Binary Representation:
• "great movie"
Classification:
• The classifier predicts the sentiment based on the presence of "great" and
"movie" rather than their frequency.
2. Handling Negation
What is Negation?
• Negation words (e.g., "not", "didn't", "never") can flip the sentiment
of a sentence.
• For example:
• "I like this movie" (Positive)
• "I didn't like this movie" (Negative)
Why Handle Negation?
• Negation can completely change the meaning of a sentence, so
it’s important to capture its effect in sentiment analysis.
How to Handle Negation:
• A simple approach is to prepend "NOT_" to every word after a
negation token (e.g., "not", "didn't", "never") until the next
punctuation mark.
• Example:
• Original: "I didn't like this movie, but I enjoyed the acting."
• After negation handling: "I didn't NOT_like NOT_this NOT_movie, but I
enjoyed the acting."
Effect:
• Words like "NOT_like" and "NOT_movie" act as cues for negative
sentiment.
• Words like "NOT_bored" and "NOT_dismiss" act as cues for
positive sentiment.
Example:
• Training Data:
• Positive: "I enjoyed the movie"
• Negative: "I didn't enjoy the movie"
• Test Document:
• "I didn't like the acting"
• After Negation Handling:
• "I didn't NOT_like NOT_the NOT_acting"
• Classification:
• The classifier recognizes "NOT_like" and "NOT_acting" as negative cues.
3. Using Sentiment Lexicons
What are Sentiment Lexicons?
• Sentiment lexicons are pre-annotated lists of
words marked with positive or negative sentiment.
• Examples of popular lexicons:
• MPQA Subjectivity Lexicon: Contains 6,885 words marked as
strongly/weakly positive or negative.
• General Inquirer, LIWC, Hu and Liu Opinion Lexicon.
Why Use Sentiment Lexicons?
• When labeled training data is limited, sentiment lexicons
provide a reliable way to identify positive and negative
words.
• They help generalize better when the test data differs from
the training data.
How to Use Sentiment Lexicons:
• Add features like:
• "This word occurs in the positive lexicon."
• "This word occurs in the negative lexicon."
• Instead of counting individual words, count occurrences of
lexicon-based features.
Example:
Training Data:
• Positive: "great film"
• Negative: "awful movie"
Lexicon Features:
• Positive Lexicon: "great" → Count for positive feature.
• Negative Lexicon: "awful" → Count for negative feature.
Test Document:
• "great acting"
Classification:
• The classifier uses the lexicon features to predict sentiment.
Summary of Optimizations
1.Binary Naive Bayes:
1. Focuses on word presence rather than frequency.
2. Reduces noise and improves performance for sentiment analysis.
2.Handling Negation:
1. Modifies words after negation tokens to capture sentiment changes.
2. Ensures that negation flips are properly represented.
3.Sentiment Lexicons:
1. Uses pre-annotated word lists to identify positive and negative words.
2. Provides robust features when training data is limited.
Why These Optimizations Matter
• Improved Accuracy:
• Binary Naive Bayes reduces noise from word frequency.
• Negation handling captures subtle sentiment changes.
• Sentiment lexicons provide reliable features for sparse data.
• Real-World Applications:
• Sentiment analysis for product reviews, social media, and customer
feedback.
Naive Bayes for Other Text Classification
Tasks
• Naive Bayes is a powerful and flexible algorithm that can be used for
many text classification tasks beyond sentiment analysis.
• In this section, we’ll explore how Naive Bayes can be adapted for two
important tasks:
• Spam detection and
• Language identification (language ID).
• We’ll also discuss how custom features can make the classifier more
effective for these tasks.
1. Spam Detection
• What is Spam Detection?
• Spam detection is about identifying unwanted emails (spam)
from legitimate ones (ham).
• It’s like teaching a computer to recognize junk mail so it can filter it
out of your inbox.
• Why Custom Features are Needed:
• Spam emails often use tricky language or specific patterns to
trick people.
• For example, spam emails might say things like “You’ve won
$1,000,000!” or “Click here for a free prize!”
• Using all words as features might not work well because spam
emails can use normal-sounding words to hide their true nature.
Custom Features for Spam Detection:
• Phrases and Patterns:
• Look for specific phrases like “one hundred percent guaranteed” or
“urgent reply.”
• Use regular expressions to match patterns like “mentions millions of
dollars” or “online pharmaceutical.”
• Non-Linguistic Features:
• Check if the email subject is in ALL CAPS.
• Look for suspicious HTML code, like unbalanced “head” tags.
• Analyze the email’s metadata (e.g., where it came from).
Example: SpamAssassin Features
• Phrases:
• “one hundred percent guaranteed”
• “urgent reply”
• Patterns:
• Matches large sums of money (e.g., “$1,000,000”).
• Non-Linguistic Features:
• Email subject line is all capital letters.
• HTML has unbalanced “head” tags.
• Claims you can be removed from the list.
How Naive Bayes Works for Spam
Detection:
• Training:
• The classifier learns from labeled emails (spam vs. ham) using custom
features like phrases, patterns, and non-linguistic features.
• Testing:
• For a new email, the classifier checks for these custom features.
• Classification:
• If the email has many spam-like features, it’s classified as spam.
Otherwise, it’s classified as ham.
2. Language Identification (Language ID)
• What is Language ID?
• Language ID is about figuring out what language a piece of text is
written in.
• For example, is the text in English, Spanish, or French?
• Why Custom Features are Needed:
• Words alone might not be enough because many languages share
common words (e.g., “the” in English and Dutch).
• Instead, we use character n-grams (small sequences of
characters) to capture language-specific patterns.
Custom Features for Language ID:
• Character n-grams:
• 2-grams: ‘th’, ‘er’, ‘in’
• 3-grams: ‘the’, ‘ing’, ‘and’
• 4-grams: ‘tion’, ‘ment’
• Byte n-grams:
• Treat text as a sequence of raw bytes (useful for handling different
character encodings).
Example: langid.py System
• Features:
• Uses all possible n-grams of lengths 1-4.
• Selects the most informative 7,000 features.
• Training Data:
• Multilingual text from sources like Wikipedia, Twitter, and Bible
translations.
• Includes regional dialects (e.g., Nigerian English, African American
Vernacular English).
How Naive Bayes Works for Language ID
• Training:
• The classifier learns from multilingual text using character or byte n-
grams as features.
• Testing:
• For a new text, the classifier checks for the presence of these n-grams.
• Classification:
• The text is classified as the language with the highest probability.
Why Custom Features Matter
1.Spam Detection:
1. Spam emails use tricky language and specific patterns, so custom features
like phrases and non-linguistic properties are necessary to catch them.
2.Language ID:
1. Words alone might not distinguish between languages, so character n-
grams are used to capture language-specific patterns.
Real-World Examples
• Spam Detection Example:
• Email:
• Subject: “URGENT REPLY NEEDED!!!”
• Body: “You have won $1,000,000! Click here to claim your prize.”
• Custom Features:
• Subject line is all capital letters.
• Contains the phrase “URGENT REPLY.”
• Mentions a large sum of money (“$1,000,000”).
• Classification:
• The classifier identifies these features and classifies the email as spam.
Language ID Example:
• Text:
• “El perro está en la casa.”
• Character n-grams:
• 2-grams: ‘El’, ‘ p’, ‘er’, ‘ro’, ‘ e’, ‘st’, ‘á e’, ‘n l’, ‘a c’, ‘as’, ‘a.’
• Classification:
• The classifier recognizes the n-grams as Spanish and classifies the text
as Spanish.
Interactive Activity
• Activity 1: Spam Detection
• Task:
• Look at the following email and identify spam-like features:
• Subject: “Congratulations! You’ve won a free iPhone!”
• Body: “Click here to claim your prize now!”
• Questions:
• What phrases or patterns make this email suspicious?
• Would you classify it as spam or ham?
Activity 2: Language ID
• Task:
• Look at the following text and guess the language:
• “Le chat est sur la table.”
• Questions:
• What character n-grams can you identify?
• What language do you think this is?
Summary
• Naive Bayes can be adapted for spam detection and language ID by
using custom features.
• For spam detection, features like phrases, patterns, and non-linguistic
properties are effective.
• For language ID, character or byte n-grams are used to capture
language-specific patterns.
• These custom features make the classifier more accurate and robust
for real-world applications.
Naive Bayes as a Language Model
• In this section, we’ll explore how Naive Bayes can be viewed as
a language model.
• Specifically, we’ll see how Naive Bayes, when using individual
word features, behaves like a set of class-specific unigram
language models.
• This means that for each class (e.g., positive or negative
sentiment), Naive Bayes creates a separate language model that
assigns probabilities to words and sentences.
1. Naive Bayes as a Language Model
• When Naive Bayes uses individual word features (and all words in the
text), it can be seen as a set of class-specific unigram language
models.
• A unigram language model assigns probabilities to individual words,
assuming each word is independent of the others.
• For each class (e.g., positive or negative), Naive Bayes creates a
separate unigram language model.
2. Sentence Probability
• The Naive Bayes model assigns a probability to a sentence by
multiplying the probabilities of the words in the sentence, given the
class.
• Formula: