0% found this document useful (0 votes)
5 views50 pages

ml3 - Text Classification - Naive Bayes

The document discusses text classification using the Naive Bayes method, highlighting the differences between generative and discriminative models. It explains the foundational concepts of probability, Bayes theorem, and how Naive Bayes classifiers operate under the assumption of conditional independence among features. The document also addresses challenges such as zero probabilities and underflow, and presents techniques like smoothing and log probability calculations to improve model performance.

Uploaded by

David Cano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views50 pages

ml3 - Text Classification - Naive Bayes

The document discusses text classification using the Naive Bayes method, highlighting the differences between generative and discriminative models. It explains the foundational concepts of probability, Bayes theorem, and how Naive Bayes classifiers operate under the assumption of conditional independence among features. The document also addresses challenges such as zero probabilities and underflow, and presents techniques like smoothing and log probability calculations to improve model performance.

Uploaded by

David Cano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Text Classification – Naive Bayes

June 19, 2013

Credits for slides: Allan, Arms, Manning, Lund, Noble, Page.


Generative and Discriminative
Models: An analogy
 The task is to determine the language that someone is
speaking
 Generative approach:
 is to learn each language and determine as to which
language the speech belongs to
 Discriminative approach:
 is to determine the linguistic differences without learning
any language – a much easier task!
Taxonomy of ML Models
 Generative Methods
 Model class-conditional pdfs and prior probabilities
 “Generative” since sampling can generate synthetic data points
 Popular models
 Gaussians, Naïve Bayes, Mixtures of multinomials
 Mixtures of Gaussians, Mixtures of experts, Hidden Markov Models
(HMM)
 Sigmoidal belief networks, Bayesian networks, Markov random fields
 Discriminative Methods
 Directly estimate posterior probabilities
 No attempt to model underlying probability distributions
 Focus computational resources on given task – better performance
 Popular models
 Logistic regression, SVMs (Kernel methods)
 Traditional neural networks, Nearest neighbor
 Conditional Random Fields (CRF)
Summary of Basic Probability Formulas

 Product rule: probability of a conjunction of two events A and B


P(AÙ B) = P(A | B)P(B) = P(B | A)P(A)
 Sum rule: probability of a disjunction of two events A and B
P(AÚ B) = P(A) + P(B) - P(AÙ B)
 Bayes theorem: the posterior probability of A given B
P(B | A)P(A)
P(A | B) =
P(B)
 Theorem of total probability: if events A1, …, An are mutually
exclusive with ån P(Ai) =1
i=1
n
P(B) = å P(B | Ai )P(Ai )
i=1
Generative Probabilistic Models
 Assume a simple (usually unrealistic) probabilistic method
by which the data was generated.

 For categorization, each category has a different


parameterized generative model that characterizes that
category.

 Training: Use the data for each category to estimate the


parameters of the generative model for that category.

 Testing: Use Bayesian analysis to determine the category


model that most likely generated a specific test instance.
Bayesian Methods
 Learning and classification methods based on probability
theory.
 Bayes theorem plays a critical role in probabilistic learning
and classification.
 Build a generative model that approximates how data is
produced
 Use prior probability of each category given no information
about an item.
 Categorization produces a posterior probability distribution
over the possible categories given a description of an
item.
Bayes Theorem

P(x | c)P(c)
P(c | x) =
P(x)
Bayes Classifiers for Categorical Data
Task: Classify a new instance x based on a tuple of attribute
values x = x1, x 2 ,… , x n into one of the classes cj  C

cMAP  argmax P(c j | x1 , x2 ,, xn )


c j C
P( x1 , x2 ,, xn | c j ) P(c j )
 argmax
c j C P( x1 , x2 ,, xn )
 argmax P( x1 , x2 ,, xn | c j ) P(c j )
c j C

Example Color Shape Class attributes


1 red circle positive
2 red circle positive
values
3 red square negative
4 blue circle negative
Joint Distribution
 The joint probability distribution for a set of random variables, X1,…,Xn
gives the probability of every combination of values: P(X1,…,Xn)
positive negative
circle square circle square
red 0.20 0.02 red 0.05 0.30
blue 0.02 0.01 blue 0.20 0.20

Example Color Shape Class


1 red circle positive
2 red circle positive
3 red square negative
4 blue circle negative

Joint Distribution
 The joint probability distribution for a set of random variables, X1,…,Xn
gives the probability of every combination of values: P(X1,…,Xn)
positive negative
circle square circle square
red 0.20 0.02 red 0.05 0.30
blue 0.02 0.01 blue 0.20 0.20

 The probability of all possible conjunctions can be calculated by


summing the appropriate subset of values from the joint distribution.
P(red Ùcircle) = ?
P(red) = ?
Joint Distribution
 The joint probability distribution for a set of random variables, X1,…,Xn
gives the probability of every combination of values: P(X1,…,Xn)
positive negative
circle square circle square
red 0.20 0.02 red 0.05 0.30
blue 0.02 0.01 blue 0.20 0.20

 The probability of all possible conjunctions can be calculated by


summing the appropriate subset of values from the joint distribution.
P(red  circle )  0.20  0.05  0.25
P(red )  0.20  0.02  0.05  0.3  0.57
Joint Distribution
 The joint probability distribution for a set of random variables, X1,…,Xn
gives the probability of every combination of values: P(X1,…,Xn)
positive negative
circle square circle square
red 0.20 0.02 red 0.05 0.30
blue 0.02 0.01 blue 0.20 0.20
 The probability of all possible conjunctions can be calculated by
summing the appropriate subset of values from the joint distribution.

P(red  circle )  0.20  0.05  0.25


P(red )  0.20  0.02  0.05  0.3  0.57
 Therefore, all conditional probabilities can also be calculated.
Joint Distribution
 The joint probability distribution for a set of random variables, X1,…,Xn
gives the probability of every combination of values: P(X1,…,Xn)
positive negative
circle square circle square
red 0.20 0.02 red 0.05 0.30
blue 0.02 0.01 blue 0.20 0.20

 The probability of all possible conjunctions can be calculated by


summing the appropriate subset of values from the joint distribution.
P(red  circle )  0.20  0.05  0.25
P(red )  0.20  0.02  0.05  0.3  0.57

 Therefore, all conditional probabilities can also be calculated.


P(positive | red Ùcircle) = ?
Joint Distribution
 The joint probability distribution for a set of random variables, X1,…,Xn
gives the probability of every combination of values: P(X1,…,Xn)
positive negative
circle square circle square
red 0.20 0.02 red 0.05 0.30
blue 0.02 0.01 blue 0.20 0.20
 The probability of all possible conjunctions can be calculated by
summing the appropriate subset of values from the joint distribution.

P(red  circle )  0.20  0.05  0.25


P(red )  0.20  0.02  0.05  0.3  0.57
 Therefore, all conditional probabilities can also be calculated.

P( positive  red  circle ) 0.20


P( positive | red  circle )    0.80
P(red  circle ) 0.25
Bayes Classifiers

c MAP = argmax P(x1, x 2,… , x n | c j )P(c j )


c j ÎC
Bayes Classifiers

c MAP = argmax P(x1, x 2,… , x n | c j )P(c j )


c j ÎC

 P(cj)
 Can be estimated from the frequency of classes in the
training examples.
 P(x1,x2,…,xn|cj)
 O(|X|n|C|) parameters
 Could only be estimated if a very, very large number of
training examples was available.
 Need to make some sort of independence
assumptions about the features to make learning
tractable.
The Naïve Bayes Classifier
Flu

X1 X2 X3 X4 X5
runnynose sinus cough fever muscle-ache

 Conditional Independence Assumption: attributes


are independent of each other given the class:

P( X 1 ,, X 5 | C )  P( X 1 | C )  P( X 2 | C )   P( X 5 | C )

 Multi-valued variables: multivariate model


 Binary variables: multivariate Bernoulli model
Learning the Model

X1 X2 X3 X4 X5 X6

 First attempt: maximum likelihood estimates


 simply use the frequencies in the data
N (C  c j )
Pˆ (c j ) 
N
N ( X i  xi , C  c j )
Pˆ ( xi | c j ) 
N (C  c j )
Problem with Max Likelihood
Flu

X1 X2 X3 X4 X5
runnynose sinus cough fever muscle-ache

P( X 1 ,, X 5 | C )  P( X 1 | C )  P( X 2 | C )   P( X 5 | C )
 What if we have seen no training cases where patient had no flu
and muscle aches?

ˆ N ( X 5  t , C  nf )
P( X 5  t | C  nf )  0
N (C  nf )
 Zero probabilities cannot be conditioned away, no matter the
other evidence!

  arg max c Pˆ (c)i Pˆ ( xi | c)


Smoothing to Improve
Generalization on Test Data
N ( X i  xi , C  c j )  1
Pˆ ( xi | c j ) 
N (C  c j )  k
# of values of Xi

 Somewhat more subtle version overall fraction in


data where Xi=xi,k

N ( X i  xi ,k , C  c j )  mpi ,k
Pˆ ( xi ,k | c j ) 
N (C  c j )  m
extent of
“smoothing”
Underflow Prevention
 Multiplying lots of probabilities, which are between 0 and
1 by definition, can result in floating-point underflow.
 Since log(xy) = log(x) + log(y), it is better to perform all
computations by summing logs of probabilities rather
than multiplying probabilities.
 Class with highest final un-normalized log probability
score is still the most probable.

cNB  argmax log P(c j ) 


c jC
 log P( x | c )
i positions
i j
Probability Estimation Example
Ex Size Colo Shape Class
Probability positive negative
r P(Y)
1 small red circle positive P(small | Y)

2 large red circle positive P(medium | Y)


P(large | Y)
3 small red triangle negative
P(red | Y)
4 large blue circle negative P(blue | Y)
P(green | Y)
P(square | Y)
P(triangle | Y)
P(circle | Y)
Probability Estimation Example
Ex Size Colo Shape Class
Probability positive negative
r P(Y) 0.5 0.5
1 small red circle positive P(small | Y) 0.5 0.5

2 large red circle positive P(medium | Y) 0.0 0.0


P(large | Y) 0.5 0.5
3 small red triangle negative
P(red | Y) 1.0 0.5
4 large blue circle negative P(blue | Y) 0.0 0.5
P(green | Y) 0.0 0.0
P(square | Y) 0.0 0.0
P(triangle | Y) 0.0 0.5
P(circle | Y) 1.0 0.5
Naïve Bayes Example

Probability positive negative


P(Y) 0.5 0.5
P(small | Y) 0.4 0.4
P(medium | Y) 0.1 0.2
P(large | Y) 0.5 0.4
P(red | Y) 0.9 0.3 Test Instance:
P(blue | Y) 0.05 0.3 <medium ,red, circle>
P(green | Y) 0.05 0.4
P(square | Y) 0.05 0.4
P(triangle | Y) 0.05 0.3
P(circle | Y) 0.9 0.3

c MAP = argmaxc Pˆ (c)Õ Pˆ (x i | c)


i
Naïve Bayes Example

Probability positive negative


P(Y) 0.5 0.5
P(medium | Y) 0.1 0.2
Test Instance:
P(red | Y) 0.9 0.3
<medium ,red, circle>
P(circle | Y) 0.9 0.3

P(positive | X) =?

P(negative | X) =?

c MAP = argmaxc Pˆ (c)Õ Pˆ (x i | c)


i
Naïve Bayes Example

Probability positive negative c MAP = argmaxc Pˆ (c)Õ Pˆ (x i | c)


i
P(Y) 0.5 0.5
P(medium | Y) 0.1 0.2
Test Instance:
P(red | Y) 0.9 0.3
<medium ,red, circle>
P(circle | Y) 0.9 0.3

P(positive | X) = P(positive)*P(medium | positive)*P(red | positive)*P(circle | positive) / P(X)


0.5 * 0.1 * 0.9 * 0.9
= 0.0405 / P(X) = 0.0405 / 0.0495 = 0.8181

P(negative | X) = P(negative)*P(medium | negative)*P(red | negative)*P(circle | negative) / P(X)


0.5 * 0.2 * 0.3 * 0.3
= 0.009 / P(X) = 0.009 / 0.0495 = 0.1818

P(positive | X) + P(negative | X) = 0.0405 / P(X) + 0.009 / P(X) = 1

P(X) = (0.0405 + 0.009) = 0.0495


Question
 How can we see the multivariate Naïve Bayes model as
a generative model?

 A generative model produces the observed data by the


means of a probabilistic generation process.
 First generate a class cj according to the probability P(C)
 Then for each attribute:
 Generate xi according to P(xi | C = cj ).
Naïve Bayes Generative Model

neg
pos pos
pos neg
posneg

Category

red circ lg red circ


med blue
sm lg blue tri tricirc sm sqr
med redgrn red circ circ med med grn grn tri circ
lg lg sm sm lglg red blue circ tri sqr
sm med red blue circ sqr sm blue grn sqr tri
red
Size Color Shape Size Color Shape
Positive Negative
Naïve Bayes Inference Problem

lg red circ
?? ??

neg
pos pos
pos neg
posneg

Category

red circ lg red circ


med blue
sm lg blue tri tricirc sm sqr
med redgrn red circ circ med med grn grn tri circ
lg lg sm sm lglg red blue circ tri sqr
sm med red blue circ sqr sm blue grn sqr tri
red
Size Color Shape Size Color Shape
Positive Negative
Naïve Bayes for Text Classification

Two models:
 Multivariate Bernoulli Model
 Multinomial Model
Model 1: Multivariate Bernoulli
 One feature Xw for each word in dictionary
 Xw = true (1) in document d if w appears in d
 Naive Bayes assumption:
 Given the document’s topic, appearance of one word in
the document tells us nothing about chances that another
word appears

 Parameter estimation

Pˆ (X w =1 | c j ) = ?
Model 1: Multivariate Bernoulli
 One feature Xw for each word in dictionary
 Xw = true (1) in document d if w appears in d
 Naive Bayes assumption:
 Given the document’s topic, appearance of one word in
the document tells us nothing about chances that another
word appears

 Parameter estimation

Pˆ (X w =1 | c j ) = fraction of documents of topic cj


in which word w appears
Naïve Bayes Generative Model

neg
pos pos
pos neg
posneg

Category

no yes no yes yes


yes no
no yes yes no yes no yes
yes no yes no no
yes yes yes yes no no yes yes
noyes yes nono yes yes yes no no
no yes yes yes no yes no yes yes no no
no
w1 w2 w3 w1 w2 w3
Positive Negative
Model 2: Multinomial

Cat

w1 w2 w3 w4 w5 w6
Multinomial Distribution
“The binomial distribution is the probability distribution of the number of
"successes" in n independent Bernoulli trials, with the same probability of
"success" on each trial. In a multinomial distribution, each trial results in exactly
one of some fixed finite number k of possible outcomes, with probabilities p1, ...,
pk (so that pi ≥ 0 for i = 1, ..., k and their sum is 1), and there are n independent
trials. Then let the random variables Xi indicate the number of times outcome
number i was observed over the n trials. X=(X1,…,Xn) follows a multinomial
distribution with parameters n and p, where p = (p1, ..., pk).” (Wikipedia)
Multinomial Naïve Bayes
 Class conditional unigram language
 Attributes are text positions, values are words.
 One feature Xi for each word position in document
 feature’s values are all words in dictionary
 Value of Xi is the word in position i
 Naïve Bayes assumption:
 Given the document’s topic, word in one position in the
document tells us nothing about words in other positions

cNB  argmax P(c j ) P( xi | c j )


c jC i

 argmax P(c j ) P( x1 " our" | c j )  P( xn " text" | c j )


c jC

 Too many possibilities!


Multinomial Naive Bayes Classifiers
 Second assumption:
 Classification is independent of the positions of the words
(word appearance does not depend on position)

P( X i  w | c)  P( X j  w | c)

for all positions i,j, word w, and class c

 Use same parameters for each position


 Result is bag of words model (over tokens)
 Just have one multinomial feature predicting all

c NB = argmax P(c j )Õ P(wi | c j )


words

c j ÎC i
Multinomial Naïve Bayes for Text
 Modeled as generating a bag of words for a document
in a given category by repeatedly sampling with
replacement from a vocabulary V = {w1, w2,…wm}
based on the probabilities P(wj | ci).
 Smooth probability estimates with Laplace
m-estimates assuming a uniform distribution over all
words (p = 1/|V|) and m = |V|
Multinomial Naïve Bayes as a
Generative Model for Text

spam
legit
spamspam
legit legit
spamspam
legit
Category
Viagra science
win PM
hot ! !! computerFriday
Nigeria deal test homework
lottery nude
March score
! Viagra
$ May exam
spam legit
Naïve Bayes Inference Problem

Viagra hot deal !!


?? ??

spam
legit
spamspam
legit legit
spamspam
legit
Viagra science
win
Category PM
hot ! !! computerFriday
Nigeria deal test homework
lottery nude
March score
! Viagra
$ May exam
spam legit
Naïve Bayes Classification

c NB = argmax P(c j )Õ P(x i | c j )


c j ÎC i
Parameter Estimation
 Multivariate Bernoulli model:

Pˆ (X w =1 | c j ) = fraction of documents of topic c


in which word w appears
j

 Multinomial model:

Pˆ ( X i  w | c j ) 
fraction of times in which
word w appears
across all documents of topic cj
 Can create a mega-document for topic j by concatenating all
documents in this topic
 Use frequency of w in mega-document
A Variant of the Multinomial Model

 Represent each document d as an M-dimensional vector of


counts tft1,d,...,tftM,d ,where tfti,d is the term frequency of ti in d.

c NB = argmax P(c j )
c j ÎC
Õ P(x i |c j)
iÎ positions
Classification

 Multinomial vs Multivariate Bernoulli?

 Multinomial model is almost always more effective in


text applications!
WebKB Experiment (1998)
 Classify webpages from CS departments into:
 student, faculty, course, project, etc.
 Train on ~5,000 hand-labeled web pages
 Cornell, Washington, U.Texas, Wisconsin
 Crawl and classify a new site (CMU)

 Results:

Student Faculty Person Project Course Departmt


Extracted 180 66 246 99 28 1
Correct 130 28 194 72 25 1
Accuracy: 72% 42% 79% 73% 89% 100%
NB Model Comparison: WebKB
Feature Selection: Why?

 Text collections have a large number of features


 10,000 – 1,000,000 unique words … and more
 May make using a particular classifier feasible
 Some classifiers can’t deal with 100,000 features
 Reduces training time
 Training time for some methods is quadratic or worse in
the number of features
 Can improve generalization (performance)
 Eliminates noise features
 Avoids overfitting
Naïve Bayes - Spam Assassin
 Naïve Bayes has found a home in spam filtering
 Paul Graham’s A Plan for Spam
 A mutant with more mutant offspring...
 Widely used in spam filters
 Classic Naive Bayes superior when appropriately used
 According to David D. Lewis
 But also many other things: black hole lists, etc.

 Many email topic filters also use NB classifiers


Naïve Bayes on Spam Email

http://www.cs.utexas.edu/users/jp/research/email.paper.pdf
Naive Bayes is Not So Naive
 Naïve Bayes: First and Second place in KDD-CUP 97 competition, among
16 (then) state of the art algorithms
Goal: Financial services industry direct mail response prediction model: Predict if the
recipient of mail will actually respond to the advertisement – 750,000 records.
 Robust to Irrelevant Features
Irrelevant Features cancel each other without affecting results
 Very good in domains with many equally important features
 A good dependable baseline for text classification (but not the best)!
 Very Fast: Learning with one pass of counting over the data; testing linear in the
number of attributes, and document collection size
 Low Storage requirements

You might also like