ml3 - Text Classification - Naive Bayes
ml3 - Text Classification - Naive Bayes
P(x | c)P(c)
P(c | x) =
P(x)
Bayes Classifiers for Categorical Data
Task: Classify a new instance x based on a tuple of attribute
values x = x1, x 2 ,… , x n into one of the classes cj C
P(cj)
Can be estimated from the frequency of classes in the
training examples.
P(x1,x2,…,xn|cj)
O(|X|n|C|) parameters
Could only be estimated if a very, very large number of
training examples was available.
Need to make some sort of independence
assumptions about the features to make learning
tractable.
The Naïve Bayes Classifier
Flu
X1 X2 X3 X4 X5
runnynose sinus cough fever muscle-ache
P( X 1 ,, X 5 | C ) P( X 1 | C ) P( X 2 | C ) P( X 5 | C )
X1 X2 X3 X4 X5 X6
X1 X2 X3 X4 X5
runnynose sinus cough fever muscle-ache
P( X 1 ,, X 5 | C ) P( X 1 | C ) P( X 2 | C ) P( X 5 | C )
What if we have seen no training cases where patient had no flu
and muscle aches?
ˆ N ( X 5 t , C nf )
P( X 5 t | C nf ) 0
N (C nf )
Zero probabilities cannot be conditioned away, no matter the
other evidence!
N ( X i xi ,k , C c j ) mpi ,k
Pˆ ( xi ,k | c j )
N (C c j ) m
extent of
“smoothing”
Underflow Prevention
Multiplying lots of probabilities, which are between 0 and
1 by definition, can result in floating-point underflow.
Since log(xy) = log(x) + log(y), it is better to perform all
computations by summing logs of probabilities rather
than multiplying probabilities.
Class with highest final un-normalized log probability
score is still the most probable.
P(positive | X) =?
P(negative | X) =?
neg
pos pos
pos neg
posneg
Category
lg red circ
?? ??
neg
pos pos
pos neg
posneg
Category
Two models:
Multivariate Bernoulli Model
Multinomial Model
Model 1: Multivariate Bernoulli
One feature Xw for each word in dictionary
Xw = true (1) in document d if w appears in d
Naive Bayes assumption:
Given the document’s topic, appearance of one word in
the document tells us nothing about chances that another
word appears
Parameter estimation
Pˆ (X w =1 | c j ) = ?
Model 1: Multivariate Bernoulli
One feature Xw for each word in dictionary
Xw = true (1) in document d if w appears in d
Naive Bayes assumption:
Given the document’s topic, appearance of one word in
the document tells us nothing about chances that another
word appears
Parameter estimation
neg
pos pos
pos neg
posneg
Category
Cat
w1 w2 w3 w4 w5 w6
Multinomial Distribution
“The binomial distribution is the probability distribution of the number of
"successes" in n independent Bernoulli trials, with the same probability of
"success" on each trial. In a multinomial distribution, each trial results in exactly
one of some fixed finite number k of possible outcomes, with probabilities p1, ...,
pk (so that pi ≥ 0 for i = 1, ..., k and their sum is 1), and there are n independent
trials. Then let the random variables Xi indicate the number of times outcome
number i was observed over the n trials. X=(X1,…,Xn) follows a multinomial
distribution with parameters n and p, where p = (p1, ..., pk).” (Wikipedia)
Multinomial Naïve Bayes
Class conditional unigram language
Attributes are text positions, values are words.
One feature Xi for each word position in document
feature’s values are all words in dictionary
Value of Xi is the word in position i
Naïve Bayes assumption:
Given the document’s topic, word in one position in the
document tells us nothing about words in other positions
P( X i w | c) P( X j w | c)
c j ÎC i
Multinomial Naïve Bayes for Text
Modeled as generating a bag of words for a document
in a given category by repeatedly sampling with
replacement from a vocabulary V = {w1, w2,…wm}
based on the probabilities P(wj | ci).
Smooth probability estimates with Laplace
m-estimates assuming a uniform distribution over all
words (p = 1/|V|) and m = |V|
Multinomial Naïve Bayes as a
Generative Model for Text
spam
legit
spamspam
legit legit
spamspam
legit
Category
Viagra science
win PM
hot ! !! computerFriday
Nigeria deal test homework
lottery nude
March score
! Viagra
$ May exam
spam legit
Naïve Bayes Inference Problem
spam
legit
spamspam
legit legit
spamspam
legit
Viagra science
win
Category PM
hot ! !! computerFriday
Nigeria deal test homework
lottery nude
March score
! Viagra
$ May exam
spam legit
Naïve Bayes Classification
Multinomial model:
Pˆ ( X i w | c j )
fraction of times in which
word w appears
across all documents of topic cj
Can create a mega-document for topic j by concatenating all
documents in this topic
Use frequency of w in mega-document
A Variant of the Multinomial Model
c NB = argmax P(c j )
c j ÎC
Õ P(x i |c j)
iÎ positions
Classification
Results:
http://www.cs.utexas.edu/users/jp/research/email.paper.pdf
Naive Bayes is Not So Naive
Naïve Bayes: First and Second place in KDD-CUP 97 competition, among
16 (then) state of the art algorithms
Goal: Financial services industry direct mail response prediction model: Predict if the
recipient of mail will actually respond to the advertisement – 750,000 records.
Robust to Irrelevant Features
Irrelevant Features cancel each other without affecting results
Very good in domains with many equally important features
A good dependable baseline for text classification (but not the best)!
Very Fast: Learning with one pass of counting over the data; testing linear in the
number of attributes, and document collection size
Low Storage requirements