0% found this document useful (0 votes)
14 views36 pages

Naive Bayes and Sentiment Classification: CS6431 Natural Language Processing Spring 2023

The document discusses the application of Naive Bayes classifiers in sentiment classification, focusing on text categorization tasks such as sentiment analysis and spam detection. It outlines the supervised learning approach, the importance of feature representation, and the challenges faced, including the handling of unknown words and negation. Additionally, it covers evaluation metrics, statistical significance testing, and methods for improving classifier performance.

Uploaded by

ursady4u
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views36 pages

Naive Bayes and Sentiment Classification: CS6431 Natural Language Processing Spring 2023

The document discusses the application of Naive Bayes classifiers in sentiment classification, focusing on text categorization tasks such as sentiment analysis and spam detection. It outlines the supervised learning approach, the importance of feature representation, and the challenges faced, including the handling of unknown words and negation. Additionally, it covers evaluation metrics, statistical significance testing, and methods for improving classifier performance.

Uploaded by

ursady4u
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

NAIVE BAYES AND SENTIMENT

CLASSIFICATION
Spring 2023 CS6431 Natural Language Processing
B1:
Speech and Language Processing (Third Edition draft
– Jan2022)
Daniel Jurafsky, James H. Martin
Credits
1. B1
Assignment
Read:
B1: Chapter 4

Problems: Exercise problems of Chapter 4


Text Categorization
 Assigning a label/category to an entire sentence/document
 Sentiment analysis
 Assigning a positive or negative orientation that a writer expresses toward
some object.
 Book reviews, movie reviews, product reviews etc.

 Spam detection
 Authorship attribution
 Subject category assignment
Supervised Learning Approach
 Input:
 𝑑1 , 𝑐1 , 𝑑2 , 𝑐2 , … , (𝑑𝑁 , 𝑐𝑁 )
 And an unknown document 𝑑

 Output
 The class label for 𝑑
Naive Bayes Classifiers
Bag-of-words

Position is ignored, only frequencies are used


Naïve Bayes
 Returns the class 𝑐which
Ƹ has the maximum posterior probability given
the document.

 Plugging the Bayes rule in the above

 Dropping denominator
 Document 𝑑 be represented as a set of features 𝑓1 , 𝑓2 , … , 𝑓𝑛
 Two simplifying assumptions
 Position
of the word is not considered (does not matter)
 Naïve Bayes assumption

 The final equation


 The calculations are done in log space, to avoid underflow and
increase speed

Becomes
Training the Naive Bayes Classifier
 How to compute 𝑃 𝑐 and 𝑃(𝑤𝑖 |𝑐)?
𝑁𝑐
 𝑃 𝑐 =
𝑁𝑑𝑜𝑐
 𝑁𝑐 :
number of documents labelled with 𝑐
 𝑁𝑑𝑜𝑐 : be the total number of documents

 We’ll assume a feature is just the existence of a word in the


document’s bag of words

 𝑐:topic/class label
 𝑉: vocabulary of the dataset
A problem
 Consider the problem of movie reviews
 Imagine, no positive review in the training set contains “fantastic” but
the test set does

 Probability for class “positive” will be zero


 Solution: Laplace (add-one) smoothing

 Note: vocabulary 𝑉 consists of the union of all the word types in all classes,
not just the words in one class 𝑐 (why?)
More things to remove
 Unknown Words: words in test data but not in training data
 Ignore them / remove them from test document/sentence
 Stop words removal
 Very frequent words like ‘the’ and ‘a’.
 Sort by frequency and take top 10-100 entries as stop words

 Or, use pre-defined list


𝑁𝑐
𝑃 𝑐 =
𝑁𝑑𝑜𝑐
Improvements
 For a text classification task, whether a word occurs or not seems to
matter more than its frequency
 Clipthe word counts in each document at 1
 Binary Naïve Bayes
Counts can be greater than 1 in Binary NB!
 Dealing with negation
I really like this movie (+ve)
 I didn’t like this movie (-ve)
◼ Negation alters the meaning of every word
 Insufficient training data
 Inaccuratetraining using Naïve Bayes
 Derive features using sentiment lexicons
◼ Lists of words that are pre-annotated with positive or negative sentiment.

◼ Add a feature for +ve and –ve


◼ Count of +ve/-ve feature = count of words from the corresponding lexicon
Naive Bayes as a Language Model
 Assigns probability to N-grams and sentences, hence it can also be
seen as a Language Model
Evaluation Metrics
Confusion Matrix
 Precision and Recall alone are not sufficient (why?)
F-measure

𝛽 > 1 favors recall


𝛽 < 1 favors precision
𝛽 = 1 equal importance to precision and recall

2𝑃𝑅
 𝐹𝐵=1 or 𝐹1 =
𝑃+𝑅
 Harmonic mean is more conservative than arithmetic mean
◼ Closer to the smaller of the two numbers
Evaluating more than two classes
𝑘-fold Cross Validation
Statistical Significance Testing
 How to decide if model/classifier 𝐴 is better than 𝐵?
 𝑀 𝐴, 𝑥 : performance of model/classifier 𝐴 on test set 𝑥
 𝑀 𝐵, 𝑥 : performance of model/classifier 𝐵 on test set 𝑥
𝛿 𝑥 (effect size) = 𝑀 𝐴, 𝑥 − 𝑀 𝐵, 𝑥
 Consider 𝛿 𝑥 = .04
 We want to check if 𝐴’s superiority over 𝐵 is likely to hold again if we
checked another test set 𝑥′
 We define two hypothesis
 𝐻0 : 𝛿 𝑥 ≤ 0 (Null Hypothesis, 𝐴 is not better than 𝐵)
 𝐻1 : 𝛿 𝑥 > 0 (𝐴 is better than 𝐵)
 𝐻0 : 𝛿 𝑥 ≤ 0 (Null Hypothesis, 𝐴 is not better than 𝐵)
 We want to test if can confidently rule out the null hypothesis and instead
support 𝐻1 , i.e., 𝐴 is better
 Let 𝑋: R.V. over all test sets

 p-value: the probability, assuming the null hypothesis 𝐻0 is true, of seeing


the 𝛿 𝑥 that we saw or one even greater
 If 𝐻0 is indeed true
◼ Large 𝛿(𝑥): highly surprising, p-value should be low, reject null hypothesis
◼ Small (+ve) 𝛿(𝑥): less surprising even if 𝐻0 is true, p-value should be high
 Threshold (like .01)
◼ p-value < .01, reject null hypothesis
 We say that a result (e.g., “A is better than B”) is statistically significant if the
𝛿 we saw has a probability that is below the threshold and we therefore reject
this null hypothesis.
 How to estimate 𝑃-values?
 Create multiple test sets and measure
 Use a threshold to accept/reject a hypothesis
The Paired Bootstrap Test
 Bootstrapping: repeatedly drawing large numbers of smaller samples
with replacement

Distribution of
𝟙(𝑥): if 𝑥 is true, and 0 otherwise
𝛿 values
 Goal: assume 𝐻0 and estimate how accidental/surprising 𝛿(𝑥) is
 Since the above distribution is biased towards 𝛿 𝑥 = .2, to capture
how surprising 𝛿(𝑥) is, we compute:
 Suppose,
 10,000 bootstrapped test sets (𝑥 (𝑖) s) are created
 Threshold is .01

 Above gives p-value of .0047 (< threshold)


◼ Thus, reject the null hypothesis

You might also like