0% found this document useful (0 votes)
16 views25 pages

Bayesian Classification

The document discusses the application of the Naïve Bayes classifier for digit recognition, specifically predicting whether an image represents the digit 5 or 6 based on pixel values. It explains the use of Bayes' Rule to compute probabilities and the necessity of learning likelihood and prior functions, while addressing challenges such as the independence assumption of features. Additionally, it compares Naïve Bayes with decision trees, highlighting differences in model characteristics and training requirements.

Uploaded by

d36078067
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views25 pages

Bayesian Classification

The document discusses the application of the Naïve Bayes classifier for digit recognition, specifically predicting whether an image represents the digit 5 or 6 based on pixel values. It explains the use of Bayes' Rule to compute probabilities and the necessity of learning likelihood and prior functions, while addressing challenges such as the independence assumption of features. Additionally, it compares Naïve Bayes with decision trees, highlighting differences in model characteristics and training requirements.

Uploaded by

d36078067
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Another Application

• Digit Recognition (5 or 6)

Classifier 5
• X1,…,Xn ∈ {0,1} (Black vs. White pixels)
• Y ∈ {5,6} (predict whether a digit is a 5 or a 6)
The Bayes Classifier

• A good strategy is to predict:


X: Collection of pixel values

• (for example: what is the probability that the image represents a 5 given its
pixels?)

• So … How do we compute that?


The Bayes Classifier
Likelihood
• Use Bayes Rule! Prior

Normalization Constant = Total


probability of feature set

• Why did this help? Well, we think that we might be able to specify how features
are “generated” by the class
Y1 label Y3

Y2 X space
The Bayes Classifier
• Let’s expand this for our digit recognition task:

• To classify, we’ll simply compute these two probabilities and predict based on which one is
greater
Model Parameters
• For the Bayes classifier, we need to “learn” two functions, the likelihood and the
prior

• How many parameters are required to specify the likelihood for our digit
recognition example?
Model Parameters
• How many parameters are required to specify the likelihood?
• (Supposing that each image is 30x30 pixels)

?
Model Parameters
• The problem with explicitly modeling P(X1,…,Xn|Y) is that there are usually way
too many parameters:
• We’ll run out of space
• We’ll run out of time
• And we’ll need tons of training data (which is usually not available)
The Naïve Bayes Model
• The Naïve Bayes Assumption: Assume that all features are independent given the
class label Y
• Equationally speaking:

• (We will discuss the validity of this assumption later)


Why is this useful?
• # of likelihoods for modeling Pk(X1,…,Xn|Y)
• K Classes and n features:

▪ K(2n) = 2*2900 Likelihoods

• # of parameters for modeling P(X1|Y),…,P(Xn|Y)

▪ Kn (2*900 Likelihoods)

▪ K Priors
Naïve Bayes Training
• Now that we’ve decided to use a Naïve Bayes classifier, we need to train it with some data.
Assume BW images:

MNIST Training Data


Naïve Bayes Training
• Training in Naïve Bayes is easy:
• FOR PRIORS: Estimate P(Y=v) as the fraction of records with Y=v

• FOR LIKELIHOOD-FACTORS: Estimate P(Xi=u|Y=v) as the fraction of


records with Y=v for which Xi=u
Naïve Bayes Training - smoothing
• In practice, some of these probabilities/ counts can be
zero
m*p
• Fix this by adding “virtual” counts:

m
• m = Number of values the parameter may take
• p probability of ith parameter value (1/m if
equiprobable)
Smoothing

Color Images NB Training
• For binary digits, how many pixel values are there ?
• training amounts to either
• finding probabilities of each pixel being R,G,B for each class
• finding normal distribution averages and std dev for each of R,G,B values for
each class
Naïve Bayes Classification
Another Example of the Naïve Bayes Classifier
The weather data, with counts and probabilities
outlook temperature humidity windy play
yes no yes no yes no yes no yes no

sunny 2 3 hot 2 2 high 3 4 false 6 2 9 5


overcast 4 0 mild 4 2 normal 6 1 true 3 3
rainy 3 2 cool 3 1
sunny 2/9 3/5 hot 2/9 2/5 high 3/9 4/5 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 mild 4/9 2/5 normal 6/9 1/5 true 3/9 3/5
rainy 3/9 2/5 cool 3/9 1/5

A new day
outlook temperature humidity windy play
sunny cool high true ?
• Likelihood of yes

• Likelihood of no

• Therefore, the prediction is No


The Naive Bayes Classifier for Data
Sets with Numerical Attribute Values

• One common practice to handle numerical attribute


values is to assume normal distributions for numerical
attributes.
The numeric weather data with summary statistics
outlook temperature humidity windy play
yes no yes no yes no yes no yes no

sunny 2 3 83 85 86 85 false 6 2 9 5
overcast 4 0 70 80 96 90 true 3 3
rainy 3 2 68 65 80 70
64 72 65 95
69 71 70 91
75 80
75 70
72 90
81 75
sunny 2/9 3/5 mean 73 74.6 mean 79.1 86.2 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 std 6.2 7.9 std 10.2 9.7 true 3/9 3/5
dev dev
rainy 3/9 2/5
TWO WAYS TO HANDLE CONTINUOUS VALUED ATTRIBUTES

μ
Deriving Normal Distribution
• Let x1, x2, …, xn be the values of a numerical attribute
in the training data set.
Given a new case: Outlook = sunny, temperature = 66, humidity = 80 ,
wind = true. Posterior probabilities?

• For examples,

• Likelihood of Yes =

• Likelihood of No =

Total Prob of Yes = 36/(36+136)= 26.47%


Total Prob of No = 136/(36+136)= 73.53%
Outputting Probabilities

• What’s nice about Naïve Bayes (and generative models in general) is that it
returns probabilities
• These probabilities can tell us how confident the algorithm is
• Such a confidence level is not immediately present in DT
Comparison of DT and NB

DT NB
1. Greedy heuristic 1. Statistical
2. Discriminative model, cant generate data. 2. Generative model (calculates prob dist and can
generate data) and need Bayes theorem to calculate
3. Automatic feature prioritization a-posteriors
4. Overfitting - Need pruning / stop growth 3. Manual feature selection
5. Support at leaves 4. No Need for pruning or post training tuning
6. No issue with disappearance of values 5. Probabilities show confidence level
7. No assumption of independence of features 6. Can suffer vanishing probs of likelihoods - smoothing
8. Discretization of continuous values needed 7. NB assumption is there
9. Good with lots of data 8. Prob distribution can take care of real values
9. Good with low amounts of data

You might also like