0% found this document useful (0 votes)
66 views29 pages

Lecture 10 Naïve Bayes Classification

This document discusses Naive Bayes classification. It begins by providing examples of classification problems like spam detection, medical diagnosis, and weather prediction. It then introduces the Bayesian classification approach and derives the Naive Bayes classifier by making the naive assumption that features are conditionally independent given the class label. The document discusses training a Naive Bayes model, classifying new examples, and some of the limitations of the naive independence assumption. It concludes by noting that Naive Bayes is easy to implement and often effective in practice.

Uploaded by

Abdul Majid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views29 pages

Lecture 10 Naïve Bayes Classification

This document discusses Naive Bayes classification. It begins by providing examples of classification problems like spam detection, medical diagnosis, and weather prediction. It then introduces the Bayesian classification approach and derives the Naive Bayes classifier by making the naive assumption that features are conditionally independent given the class label. The document discusses training a Naive Bayes model, classifying new examples, and some of the limitations of the naive independence assumption. It concludes by noting that Naive Bayes is easy to implement and often effective in practice.

Uploaded by

Abdul Majid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Lecture 8:

Decision Tree Learning


Lecture 10
Naïve Bayes Classification
Things We’d Like to Do

 Spam Classification
 Given an email, predict whether it is spam or not

 Medical Diagnosis
 Given a list of symptoms, predict whether a
patient has cancer or not

 Weather
 Based on temperature, humidity, etc… predict if it
will rain tomorrow
Bayesian Classification

 Problem statement:
 Given features X1,X2,…,Xn
 Predict a label Y
Another Application

 Digit Recognition

Classifier 5

 X1,…,Xn  {0,1} (Black vs. White pixels)


 Y  {5,6} (predict whether a digit is a 5 or a 6)
The Bayes Classifier

 In class, we saw that a good strategy is to predict:

 (for example: what is the probability that the


image represents a 5 given its pixels?)

 So … how do we compute that?


The Bayes Classifier

 Use Bayes Rule!


Likelihood Prior

Normalization Constant

 Why did this help? Well, we think that we might be


able to specify how features are “generated” by the
class label
The Bayes Classifier

 Let’s expand this for our digit recognition task:

 To classify, we’ll simply compute these two probabilities and


predict based on which one is greater
Model Parameters

 For the Bayes classifier, we need to “learn” two


functions, the likelihood and the prior

 How many parameters are required to specify the


prior for our digit recognition example?
Model Parameters

 How many parameters are required to specify the


likelihood?
 (Supposing that each image is 30x30 pixels)
Model Parameters

 The problem with explicitly modeling P(X1,…,Xn|Y) is


that there are usually way too many parameters:
 We’ll run out of space

 We’ll run out of time

 And we’ll need tons of training data (which is

usually not available)


The Naïve Bayes Model

 The Naïve Bayes Assumption: Assume that all


features are independent given the class label Y
 Equationally speaking:

 (We will discuss the validity of this assumption later)


Why is this useful?

 # of parameters for modeling P(X1,…,Xn|Y):

 2(2n-1)

 # of parameters for modeling P(X1|Y),…,P(Xn|Y)

 2n
Naïve Bayes Training

 Now that we’ve decided to use a Naïve Bayes classifier, we need


to train it with some data:

MNIST Training
Naïve Bayes Training

 Training in Naïve Bayes is easy:


 Estimate P(Y=v) as the fraction of records with Y=v

 Estimate P(Xi=u|Y=v) as the fraction of records with Y=v for


which Xi=u

 (This corresponds to Maximum Likelihood estimation of model


parameters)
Naïve Bayes Training

 In practice, some of these counts can be zero


 Fix this by adding “virtual” counts:

 (This is like putting a prior on parameters and


doing MAP estimation instead of MLE)
 This is called Smoothing
Naïve Bayes Training

 For binary digits, training amounts to averaging all of the


training fives together and all of the training sixes together.
Naïve Bayes Classification
Outputting Probabilities

 What’s nice about Naïve Bayes (and generative


models in general) is that it returns probabilities
 These probabilities can tell us how confident the

algorithm is
 So… don’t throw away those probabilities!
Performance on a Test Set

 Naïve Bayes is often a good choice if you don’t have much


training data!
Naïve Bayes Assumption

 Recall the Naïve Bayes assumption:

 that all features are independent given the class label Y

 Does this hold for the digit recognition problem?


Exclusive-OR Example
 For an example where conditional independence
fails:
 Y=XOR(X1,X2)

X1 X2 P(Y=0|X1,X2) P(Y=1|X1,X2)
0 0 1 0
0 1 0 1
1 0 0 1
1 1 1 0
 Actually, the Naïve Bayes assumption is almost never
true

 Still… Naïve Bayes often performs surprisingly well


even when its assumptions do not hold
Numerical Stability

 It is often the case that machine learning algorithms


need to work with very small numbers
 Imagine computing the probability of 2000

independent coin flips


 MATLAB thinks that (.5)
2000=0
Numerical Stability

 Instead of comparing P(Y=5|X1,…,Xn) with


P(Y=6|X1,…,Xn),
 Compare their logarithms
Recovering the Probabilities

 Suppose that for some constant K, we have:

 And

 How would we recover the original probabilities?


Recovering the Probabilities

 Given:
 Then for any constant C:

 One suggestion: set C such that the greatest i is


shifted to zero:
Recap

 We defined a Bayes classifier but saw that it’s


intractable to compute P(X1,…,Xn|Y)
 We then used the Naïve Bayes assumption – that
everything is independent given the class label Y

 A natural question: is there some happy compromise


where we only assume that some features are
conditionally independent?
 Stay Tuned…
Conclusions

 Naïve Bayes is:


 Really easy to implement and often works well

 Often a good first thing to try

 Commonly used as a “punching bag” for smarter

algorithms

You might also like