0% found this document useful (0 votes)
10 views46 pages

Multimedia Application L7 - For

The document provides an overview of Naive Bayes classifiers, including their types, training methods, and applications in text classification tasks such as sentiment analysis and spam detection. It explains Bayes' theorem, the bag-of-words representation, and the process of training and predicting with Naive Bayes models. Additionally, it discusses the importance of features, handling unknown words, and the use of lexicons in sentiment classification.

Uploaded by

SX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views46 pages

Multimedia Application L7 - For

The document provides an overview of Naive Bayes classifiers, including their types, training methods, and applications in text classification tasks such as sentiment analysis and spam detection. It explains Bayes' theorem, the bag-of-words representation, and the process of training and predicting with Naive Bayes models. Additionally, it discusses the importance of features, handling unknown words, and the use of lexicons in sentiment classification.

Uploaded by

SX
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Multimedia

Application
By

Minhaz Uddin Ahmed, PhD


Department of Computer Engineering
Inha University Tashkent.
Email: [email protected]
Content
 Naive Bayes Classifiers
 Training the Naive Bayes Classifier
 Worked example
 Optimizing for Sentiment Analysis
 Naive Bayes for other text classification tasks
 Naive Bayes as a Language Model
Bayes theorem

 Bayes’ theorem is also known as Bayes’ Rule or Bayes’ law, which


is used to determine the probability of a hypothesis with prior
knowledge. It depends on the conditional probability

P(A|B) is Posterior probability: Probability of hypothesis A on the


observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given
that the probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing
the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Types Of Naive Bayes:

 There are three types of Naive Bayes model under the scikit-learn library:
• Gaussian: It is used in classification and it assumes that features follow
a normal distribution.
• Multinomial: It is used for discrete counts. For example, let’s say, we
have a text classification problem. Here we can consider Bernoulli trials
which is one step further and instead of “word occurring in the
document”, we have “count how often word occurs in the document”,
you can think of it as “number of times outcome number x_i is observed
over the n trials”.
• Bernoulli: The binomial model is useful if your feature vectors are
binary (i.e. zeros and ones). One application would be text classification
with ‘bag of words’ model where the 1s & 0s are “word occurs in the
document” and “word does not occur in the document” respectively.
Example
Example solution

 Solution:
 P(A|B) = (P(B|A) * P(A) )/ P(B)
1. Mango:
 P(X | Mango) = P(Ye | Yellow ) * P(Sweet | Mango) * P(Long | Mango)
a)P(Yellow | Mango) = (P(Mango | Yellow) * P(Yellow) )/ P (Mango)
 = ((350/800) * (800/1200)) / (650/1200)
 P(Yellow | Mango)= 0.53 →1
Text Classification

 Assigning subject categories, topics, or genres


 Spam detection
 Authorship identification
 Age/gender identification
 Language Identification
 Sentiment analysis
Who wrote which Federalist papers?

 1787-8: anonymous essays try to convince New York to ratify U.S


Constitution: Jay, Madison, Hamilton.
 Authorship of 12 of the letters in dispute
 1963: solved by Mosteller and Wallace using Bayesian methods

James Madison Alexander Hamilton


Male or female author from a given
text
By 1925 present-day Vietnam was divided into three parts under French
colonial rule. The southern region embracing Saigon and the Mekong
delta was the colony of Cochin-China; the central area with its imperial
capital at Hue was the protectorate of Annam …

Clara never failed to be astonished by the extraordinary felicity of her


own name. She found it hard to trust herself to the mercy of fate, which
had managed over the years to convert her greatest shame into one of
her greatest assets…
Text Classification: definition

 Input:
 a document d
 a fixed set of classes C = {c1, c2,…, cJ}

 Output: a predicted class c  C


Classification Methods:
Hand-coded rules
 Rules based on combinations of words or other features
 spam: black-list-address OR (“dollars” AND“have been selected”)
 Accuracy can be high
 If rules carefully refined by expert
 But building and maintaining these rules is expensive
Classification Methods:
Supervised Machine Learning
 Input:
 a document d

 a fixed set of classes C = {c1, c2,…, cJ}


 A training set of m hand-labeled documents (d1,c1),....,(dm,cm)

 Output:
 a learned classifier γ:d  c
Classification Methods:
Supervised Machine Learning
 Any kind of classifier
 Naïve Bayes

 Logistic regression

 Support-vector machines

 k-Nearest Neighbors
Naive Bayes Intuition

 Simple ("naive") classification method based on Bayes rule


 Relies on very simple representation of document
 Bag of words
The Bag of Words Representation

 We preprocess the dataset by converting each email into a


bag-of-words representation, where each word is a feature and
its frequency in the email is its value. We also assign a label
(spam or not spam) to each email
The Bag of Words Representation
The bag of words representation

seen 2
sweet 1

γ whimsical
recommend
happy
1
1
1
)=c
( ... ...
Training

 We train the Naïve Bayes classifier on the labeled dataset. During


training, the classifier calculates the probabilities of each word
occurring in spam and not spam emails, as well as the prior
probabilities of spam and not spam emails in the dataset.
Prediction

Step 1: Given a new email, we convert it into a bag-of-words


representation.
Step 2: For each word in the email, we calculate its
conditional probability of occurring in spam and not spam
emails based on the probabilities learned during training.
Step 3: We multiply the conditional probabilities of all words
in the email and multiply them by the prior probabilities of
spam and not spam emails.
Step 4: We compare the calculated probabilities for spam
and not spam, and classify the email as spam or not spam
based on the higher probability.
Bayes’ Rule Applied to Documents
and Classes
 For a document d and a class c
Naive Bayes Classifier (I)

MAP is “maximum a
posteriori” = most
likely class

Bayes Rule

Dropping the
denominator
Text Classification and Naïve Bayes
Learning the Multinomial Naive Bayes Model

 First attempt: maximum likelihood estimates


 simply use the frequencies in the data

𝑁𝑐
^ (𝑐 )=
𝑃 𝑗
𝑗
𝑁 𝑡𝑜𝑡𝑎𝑙
Parameter estimation

fraction of times word wi appears


among all words in documents of topic cj

 Create mega-document for topic j by concatenating all docs in this


topic
 Use frequency of w in mega-document
Problem with Maximum Likelihood

 What if we have seen no training documents with the word


fantastic and classified in the topic positive (thumbs-up)?

 Zero probabilities cannot be conditioned away, no matter the


other evidence!
Laplace (add-1) smoothing for Naïve
Bayes
Multinomial Naïve Bayes: Learning

 From training corpus, extract Vocabulary

 Calculate P(cj) terms • Calculate P(wk | cj) terms


 For each cj in C do • Textj  single doc containing all docsj
docsj  all docs with class =cj
• For each word wk in Vocabulary
nk  # of occurrences of wk in Textj
Unknown words

 What about unknown words


 that appear in our test data
 but not in our training data or vocabulary?
 We ignore them
 Remove them from the test document!
 Pretend they weren't there!
 Don't include any probability for them at all!
 Why don't we build an unknown word model?
 It doesn't help: knowing which class has more unknown words is
not generally helpful!
Stop words

 Some systems ignore stop words


 Stop words: very frequent words like the and a.
 Sort the vocabulary by word frequency in training set
 Call the top 10 or 50 words the stopword list.
 Remove all stop words from both training and test sets
As if they were never there!
 But removing stop words doesn't usually help
• So in practice most NB algorithms use all words and don't
use stopword lists
Naive Bayes: Learning

Sentiment
Example:
A worked sentiment example with
add-1 smoothing
1. Prior from training:
^ (𝑐 )=
𝑃 𝑗
𝑁𝑐 𝑗 P(-) = 3/5
𝑁 𝑡𝑜𝑡𝑎𝑙
P(+) = 2/5
2. Drop "with"
3. Likelihoods from training:
𝑐𝑜𝑢𝑛𝑡 ( 𝑤 𝑖 , 𝑐 ) +1
𝑝 ( 𝑤 𝑖|𝑐 ) =
(∑ )
𝑐𝑜𝑢𝑛𝑡 (𝑤 ,𝑐 ) + ¿ 𝑉 ∨¿ ¿ 4. Scoring the test set:
𝑤 ∈𝑉
Optimizing for sentiment analysis

For tasks like sentiment, word occurrence seems to be more


important than word frequency.
 The occurrence of the word fantastic tells us a lot
 Thefact that it occurs 5 times may not tell us much
more.
Binary multinominal naive bayes, or binary NB
 Clip our word counts at 1
 Note: this is different than Bernoulli naive bayes; see the textbook
at the end of the chapter.
Binary Multinomial Naïve Bayes:
Learning

• From training corpus, extract Vocabulary


• Remove duplicates in each doc:
• For each word type w in docj
 Calculate P(cj) terms • Calculate P(wk | c ) terms
j
• Retain only a single instance of w
 For each cj in C do • Textj  single doc containing all docsj
docsj  all docs with class =cj • For each word wk in Vocabulary
nk  # of occurrences of wk in Textj
Binary Multinomial Naive Bayes
on a test document d
First remove all duplicate words from d
Then compute NB using the same equation
Binary multinominal naive Bayes
Binary multinominal naive Bayes

Counts can still be 2! Binarization is within-doc!


More on Sentiment Classification

 I really like this movie

I really don't like this movie

Negation changes the meaning of "like" to negative.


Negation can also change negative to positive-ish
◦ Don't dismiss this film
◦ Doesn't let us get bored
Sentiment Classification: Lexicons

Sometimes we don't have enough labeled training data


In that case, we can make use of pre-built word lists
Called lexicons
There are various publicly available lexicons
MPQA Subjectivity Cues Lexicon

Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in
Phrase-Level Sentiment Analysis. Proc. of HLT-EMNLP-2005.

Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003.

 Home page: https://mpqa.cs.pitt.edu/lexicons/subj_lexicon/


 6885 words from 8221 lemmas, annotated for intensity (strong/weak)
 2718 positive
 4912 negative
 + : admirable, beautiful, confident, dazzling, ecstatic, favor, glee, great
 − : awful, bad, bias, catastrophe, cheat, deny, envious, foul, harsh,
hate
Using Lexicons in Sentiment
Classification
Add a feature that gets a count whenever a word from the lexicon
occurs
 E.g., a feature called "this word occurs in the positive lexicon" or
"this word occurs in the negative lexicon"
Now all positive words (good, great, beautiful, wonderful) or negative
words count for that feature.
Using 1-2 features isn't as good as using all the words.
• But when training data is sparse or not representative of the test set,
dense lexicon features can help
Naive Bayes in Other tasks: Spam
Filtering
 Spam Assassin Features:
 Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)
 From: starts with many numbers
 Subject is all capitals
 HTML has a low ratio of text to image area
 "One hundred percent guaranteed"
 Claims you can be removed from the list
Naive Bayes in Language ID

 Determining what language a piece of text is


written in.
Features based on character n-grams do very
well
 Important to train on lots of varieties of each
language
(e.g., American English varieties like African-American English,
or English varieties around the world like Indian English)
Summary: Naive Bayes is Not So
Naive
 Very Fast, low storage requirements
 Work well with very small amounts of training data
 Robust to Irrelevant Features
 Irrelevant Features cancel each other without affecting results
 Very good in domains with many equally important features
 Decision Trees suffer from fragmentation in such cases – especially if
little data
 Optimal if the independence assumptions hold: If assumed independence is
correct, then it is the Bayes Optimal Classifier for problem
 A good dependable baseline for text classification
Reference

Chapter 4
Question
Thank you

You might also like