0% found this document useful (0 votes)
45 views41 pages

Naive Bayes

The decision surface of a Naïve Bayes classifier is a set of linear decision boundaries. Specifically: - For a classification problem with two classes and two features, the decision boundary is a straight line. - For problems with more than two features, the decision boundary is a set of hyperplanes, one for each pair of classes. This is because under the conditional independence assumption of Naïve Bayes, the posterior probability P(Y|X) factors into a product of the class prior P(Y) and feature likelihoods P(X1|Y), P(X2|Y), etc. Each factor contributes a linear term to the decision function, so the overall decision boundary is linear even when there are

Uploaded by

Arvind H H
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views41 pages

Naive Bayes

The decision surface of a Naïve Bayes classifier is a set of linear decision boundaries. Specifically: - For a classification problem with two classes and two features, the decision boundary is a straight line. - For problems with more than two features, the decision boundary is a set of hyperplanes, one for each pair of classes. This is because under the conditional independence assumption of Naïve Bayes, the posterior probability P(Y|X) factors into a product of the class prior P(Y) and feature likelihoods P(X1|Y), P(X2|Y), etc. Each factor contributes a linear term to the decision function, so the overall decision boundary is linear even when there are

Uploaded by

Arvind H H
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Bayesian Classifiers,

Conditional Independence
and Naïve Bayes
Required reading:
•  Mitchell draft chapter, sections 1 and 2.
(available on class website)

Machine Learning 10-601

Tom M. Mitchell
Machine Learning Department
Carnegie Mellon University

Jan 28, 2009


Feb 2, 2009
Let’s learn classifiers by learning P(Y|X)
Suppose Y=wealth, X=<gender, hours_worked>
How many parameters must we estimate?
Suppose X =<X1,… Xn>
where Xi and Y are boolean RV’s

To estimate P(Y| X) = P(Y| X1, X2, … Xn)


Can we reduce params by using Bayes Rule?
Suppose X =<X1,… Xn>
where Xi and Y are boolean RV’s
Bayes Rule

Which is shorthand for:

Equivalently:
Naïve Bayes
Naïve Bayes assumes

i.e., that Xi and Xj are conditionally


independent given Y, for all i≠j
Conditional Independence
Definition: X is conditionally independent of Y given Z, if
the probability distribution governing X is independent
of the value of Y, given the value of Z

Which we often write

E.g.,
Naïve Bayes uses assumption that the Xi are conditionally
independent, given Y

Given this assumption, then:

in general:

How many parameters needed to describe P(X|Y)? P(Y)?


•  Without conditional indep assumption?
•  With conditional indep assumption?
How many parameters to estimate?
P(X1, ... Xn | Y), all variables boolean
Without conditional independence assumption:

With conditional independence assumption:


Naïve Bayes in a Nutshell
Bayes rule:

Assuming conditional independence among Xi’s:

So, classification rule for Xnew = < X1, …, Xn > is:


Naïve Bayes Algorithm – discrete Xi

•  Train Naïve Bayes (examples)


for each* value yk
estimate
for each* value xij of each attribute Xi
estimate

•  Classify (Xnew)

* probabilities must sum to 1, so need estimate only n-1 parameters...


Estimating Parameters: Y, Xi discrete-valued

Maximum likelihood estimates (MLE’s):

Number of items in D for


which Y=yk
Example: Live in Sq Hill? P(S|G,D,M)
•  S=1 iff live in Squirrel Hill •  D=1 iff Drive to CMU
•  G=1 iff shop at SH Giant Eagle •  M=1 iff Rachel Maddow fan
Example: Live in Sq Hill? P(S|G,D,M)
•  S=1 iff live in Squirrel Hill •  D=1 iff Drive to CMU
•  G=1 iff shop at SH Giant Eagle •  M=1 iff Rachel Maddow fan
Example: Live in Sq Hill? P(S|G,D,M)
•  S=1 iff live in Squirrel Hill •  D=1 iff Drive to CMU
•  G=1 iff shop at SH Giant Eagle •  M=1 iff Rachel Maddow fan
Naïve Bayes: Subtlety #1
If unlucky, our MLE estimate for P(Xi | Y) might be
zero. (e.g., X373= Birthday_Is_January_30_1990)

•  Why worry about just one parameter out of many?

•  What can be done to avoid this?


Estimating Parameters: Y, Xi discrete-valued

Maximum likelihood estimates:

MAP estimates (Dirichlet priors):


Only difference:
“imaginary” examples
Naïve Bayes: Subtlety #2
Often the Xi are not really conditionally independent

•  We use Naïve Bayes in many cases anyway, and


it often works pretty well
–  often the right classification, even when not the right
probability (see [Domingos&Pazzani, 1996])

•  What is effect on estimated P(Y|X)?


–  Special case: what if we add two copies: Xi = Xk
Learning to classify text documents
•  Classify which emails are spam?
•  Classify which emails promise an attachment?
•  Classify which web pages are student home
pages?

How shall we represent text documents for Naïve


Bayes?
Baseline: Bag of Words Approach
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1

Zaire 0
For code and data, see
www.cs.cmu.edu/~tom/mlbook.html
click on “Software and Data”
What if we have continuous Xi ?
Eg., image classification: Xi is ith pixel
What if we have continuous Xi ?
Eg., image classification: Xi is ith pixel

Gaussian Naïve Bayes (GNB): assume

Sometimes assume variance


•  is independent of Y (i.e., σi),
•  or independent of Xi (i.e., σk)
•  or both (i.e., σ)
Gaussian (aka Normal) Distribution
Gaussian Naïve Bayes Algorithm – continuous Xi
(but still discrete Y)

•  Train Naïve Bayes (examples)


for each value yk
estimate*
for each attribute Xi estimate
class conditional mean , variance

•  Classify (Xnew)

* probabilities must sum to 1, so need estimate only n-1 parameters...


Estimating Parameters: Y discrete, Xi continuous

Maximum likelihood estimates: jth training


example

ith feature kth class


δ(z)=1 if z true,
else 0
GNB Example: Classify a person’s
cognitive activity, based on brain image

•  are they reading a sentence or viewing a picture?

•  reading the word “Hammer” or “Apartment”

•  viewing a vertical or horizontal line?

•  answering the question, or getting confused?


Stimuli for our study:

ant

or 60 distinct exemplars, presented 6 times each


fMRI voxel means for “bottle”: means defining P(Xi | Y=“bottle)

fMRI
activation
high

Mean fMRI activation over all stimuli:

average

below
average
“bottle” minus mean activation:
Scaling up: 60 exemplars

Categories Exemplars
BODY PARTS leg arm eye foot hand
FURNITURE chair table bed desk dresser
VEHICLES car airplane train truck bicycle
ANIMALS horse dog bear cow cat
KITCHEN
UTENSILS glass knife bottle cup spoon
TOOLS chisel hammer screwdriver pliers saw
BUILDINGS apartment barn house church igloo
PART OF A
BUILDING window door chimney closet arch
CLOTHING coat dress shirt skirt pants
INSECTS fly ant bee butterfly beetle
VEGETABLES lettuce tomato carrot corn celery
MAN MADE
OBJECTS refrigerator key telephone watch bell
Rank Accuracy Distinguishing among 60 words
Where in the brain is activity that
distinguishes tools vs. buildings?
Accuracyat
Accuracy of each
a radius
voxelone
with
aclassifier
radius 1 centered at each
searchlight
voxel:
voxel clusters: searchlights
Accuracies of
cubical
27-voxel
classifiers
centered at
each significant
voxel
[0.7-0.8]
What you should know:
•  Training and using classifiers based on Bayes rule

•  Conditional independence
–  What it is
–  Why it’s important

•  Naïve Bayes
–  What it is
–  Why we use it so much
–  Training using MLE, MAP estimates
–  Discrete variables (Bernoulli) and continuous (Gaussian)
Questions:
•  Can you use Naïve Bayes for a combination of
discrete and real-valued Xi?

•  How can we easily model just 2 of n attributes as


dependent?

•  What does the decision surface of a Naïve Bayes


classifier look like?
What is form of decision surface for Naïve
Bayes classifier?

You might also like