Bayesian Classification
Bayesian Classification
• Digit Recognition (5 or 6)
Classifier 5
• X1,…,Xn ∈ {0,1} (Black vs. White pixels)
• Y ∈ {5,6} (predict whether a digit is a 5 or a 6)
The Bayes Classifier
• (for example: what is the probability that the image represents a 5 given its
pixels?)
• Why did this help? Well, we think that we might be able to specify how features
are “generated” by the class
Y1 label Y3
Y2 X space
The Bayes Classifier
• Let’s expand this for our digit recognition task:
• To classify, we’ll simply compute these two probabilities and predict based on which one is
greater
Model Parameters
• For the Bayes classifier, we need to “learn” two functions, the likelihood and the
prior
• How many parameters are required to specify the likelihood for our digit
recognition example?
Model Parameters
• How many parameters are required to specify the likelihood?
• (Supposing that each image is 30x30 pixels)
?
Model Parameters
• The problem with explicitly modeling P(X1,…,Xn|Y) is that there are usually way
too many parameters:
• We’ll run out of space
• We’ll run out of time
• And we’ll need tons of training data (which is usually not available)
The Naïve Bayes Model
• The Naïve Bayes Assumption: Assume that all features are independent given the
class label Y
• Equationally speaking:
▪ Kn (2*900 Likelihoods)
▪ K Priors
Naïve Bayes Training
• Now that we’ve decided to use a Naïve Bayes classifier, we need to train it with some data.
Assume BW images:
m
• m = Number of values the parameter may take
• p probability of ith parameter value (1/m if
equiprobable)
Smoothing
•
Color Images NB Training
• For binary digits, how many pixel values are there ?
• training amounts to either
• finding probabilities of each pixel being R,G,B for each class
• finding normal distribution averages and std dev for each of R,G,B values for
each class
Naïve Bayes Classification
Another Example of the Naïve Bayes Classifier
The weather data, with counts and probabilities
outlook temperature humidity windy play
yes no yes no yes no yes no yes no
A new day
outlook temperature humidity windy play
sunny cool high true ?
• Likelihood of yes
• Likelihood of no
sunny 2 3 83 85 86 85 false 6 2 9 5
overcast 4 0 70 80 96 90 true 3 3
rainy 3 2 68 65 80 70
64 72 65 95
69 71 70 91
75 80
75 70
72 90
81 75
sunny 2/9 3/5 mean 73 74.6 mean 79.1 86.2 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 std 6.2 7.9 std 10.2 9.7 true 3/9 3/5
dev dev
rainy 3/9 2/5
TWO WAYS TO HANDLE CONTINUOUS VALUED ATTRIBUTES
μ
Deriving Normal Distribution
• Let x1, x2, …, xn be the values of a numerical attribute
in the training data set.
Given a new case: Outlook = sunny, temperature = 66, humidity = 80 ,
wind = true. Posterior probabilities?
• For examples,
• Likelihood of Yes =
• Likelihood of No =
• What’s nice about Naïve Bayes (and generative models in general) is that it
returns probabilities
• These probabilities can tell us how confident the algorithm is
• Such a confidence level is not immediately present in DT
Comparison of DT and NB
DT NB
1. Greedy heuristic 1. Statistical
2. Discriminative model, cant generate data. 2. Generative model (calculates prob dist and can
generate data) and need Bayes theorem to calculate
3. Automatic feature prioritization a-posteriors
4. Overfitting - Need pruning / stop growth 3. Manual feature selection
5. Support at leaves 4. No Need for pruning or post training tuning
6. No issue with disappearance of values 5. Probabilities show confidence level
7. No assumption of independence of features 6. Can suffer vanishing probs of likelihoods - smoothing
8. Discretization of continuous values needed 7. NB assumption is there
9. Good with lots of data 8. Prob distribution can take care of real values
9. Good with low amounts of data