Lec 09
Lec 09
Naïve Bayes
Spring 2023
[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]
Machine Learning
Feature Machine
extraction Features learning y
x (input)
(attributes of x) (predicted output)
Training and Machine Learning
Big idea: ML algorithms learn patterns between features and labels from data
You don’t have to reason about the data yourself
You’re given training data: lots of example datapoints and their actual labels
Training: Learn patterns from labeled data, and Eventually, use your algorithm to
periodically test how well you’re doing predict labels for unlabeled data
Example: Spam Filter
Input: an email Dear Sir.
Output: spam/ham
First, I must solicit your confidence in
this transaction, this is by virture of its
Setup: nature as being utterly confidencial and
Get a large collection of example emails, each labeled top secret. …
“spam” or “ham” TO BE REMOVED FROM FUTURE
Note: someone has to hand label all this data! MAILINGS, SIMPLY REPLY TO THIS
Want to learn to predict labels of new, future emails MESSAGE AND PUT "REMOVE" IN THE
SUBJECT.
Features: The attributes used to make the ham / 99 MILLION EMAIL ADDRESSES
spam decision FOR ONLY $99
Words: FREE! Ok, Iknow this is blatantly OT but I'm
Text Patterns: $dd, CAPS beginning to go insane. Had an old Dell
Non-text: SenderInContacts, WidelyBroadcast Dimension XPS sitting in the corner and
decided to put it to use, I know it was
… working pre being stuck in the corner,
but when I plugged it in, hit the power
nothing happened.
Example: Digit Recognition
Input: images / pixel grids
0
Output: a digit 0-9
1
Setup:
Get a large collection of example images, each labeled with a digit
Note: someone has to hand label all this data! 2
Want to learn to predict labels of new, future digit images
1
Features: The attributes used to make the digit decision
Pixels: (6,8)=ON
Shape Patterns: NumComponents, AspectRatio, NumLoops
… ??
Features are increasingly induced rather than crafted
Other Classification Tasks
Classification: given inputs x, predict labels (classes) y
Examples:
Medical diagnosis (input: symptoms,
classes: diseases)
Fraud detection (input: account activity,
classes: fraud / no fraud)
Automatic essay grading (input: document,
classes: grades)
Customer service email routing
Review sentiment
Language ID
… many more
Challenges
What structure should the BN have?
How should we learn its parameters?
Naïve Bayes Model
Random variables in this Bayes’ net:
Y = The label Y
F1, F2, …, Fn = The n features
Probability tables in this Bayes’ net:
P(Y) = Probability of each label occurring, given no information about
the features. Sometimes called the prior. F1 F2 Fn
P(Fi|Y) = One table per feature. Probability distribution over a feature,
given the label.
Naïve Bayes Model
To perform training:
Use the training dataset to estimate the probability tables. Y
Estimate P(Y) = how often does each label occur?
Estimate P(Fi|Y) = how does the label affect the feature?
To perform classification:
Instantiate all features. You know the input features, so they’re your F1 F2 Fn
evidence.
Query for P(Y|f1, f2, …, fn). Probability of label, given all the input features.
Use an inference algorithm (e.g. variable elimination) to compute this.
Example: Naïve Bayes for Spam Filter
Step 1: Select a ML algorithm. We choose to model the problem with Naïve Bayes.
Step 2: Choose features to use. Y: The label (spam or ham)
Y P(Y)
Y ham ?
spam ?
Row 4: P(F2=0 | Y=spam) = 0.25 because 1 out of 4 spam emails contains “free” 0 times.
Row 5: P(F2=1 | Y=spam) = 0.50 because 2 out of 4 spam emails contains “free” 1 time.
Row 6: P(F2=2 | Y=spam) = 0.25 because 1 out of 4 spam emails contains “free” 2 times.
Example: Naïve Bayes for Spam Filter
Model trained on a larger dataset:
Y: The label (spam or ham)
Y P(Y)
Y ham 0.6
spam 0.4
Or, if you don’t need probabilities, note that 0.0294 > 0.0048 and guess ham.
Naïve Bayes for Digits
Simple digit recognition version:
One feature (variable) Fij for each grid position <i,j>
Feature values are on / off, based on whether intensity Y
F1 F2 Fn
Here: lots of features, each is binary valued
|Y| parameters
F1 F2 Fn
+
Step 2: sum to get probability of evidence
spam or ham
Note: someone has to hand TO BE REMOVED FROM FUTURE
MAILINGS, SIMPLY REPLY TO THIS
label all this data! MESSAGE AND PUT "REMOVE" IN THE
Split into training, held-out, SUBJECT.
test sets
99 MILLION EMAIL ADDRESSES
FOR ONLY $99
Classifiers
Learn on the training set Ok, Iknow this is blatantly OT but I'm
(Tune it on a held-out set) beginning to go insane. Had an old Dell
Dimension XPS sitting in the corner and
Test it on new emails decided to put it to use, I know it was
working pre being stuck in the corner, but
when I plugged it in, hit the power nothing
happened.
Naïve Bayes for Text
Bag-of-words Naïve Bayes:
Features: Wi is the word at position i
As before: predict label conditioned on feature variables (spam vs. ham)
As before: assume features are conditionally independent given label
New: each Wi is identically distributed Word at position
i, not ith word in
the dictionary!
Generative model:
P(spam | w) = 98.9
General Naïve Bayes
What do we need in order to use Naïve Bayes?
Inference method
Start with a bunch of probabilities: P(Y) and the P(Fi|Y) tables
Use standard inference to compute P(Y|F1…Fn)
Nothing new here
25
20
Degree 15 polynomial
15
10
-5
-10
-15
0 2 4 6 8 10 12 14 16 18 20
Example: Overfitting
2 wins!!
Example: Overfitting
Posteriors determined by relative probabilities (odds ratios):
As an extreme case, imagine using the entire email as the only feature (e.g. document ID)
Would get the training data perfect (if deterministic labeling)
Wouldn’t generalize at all
Just making the bag-of-words assumption gives us some generalization, but isn’t enough
b b
r r b
Another option is to consider the most likely parameter value given the data
????
Unseen Events
Laplace Smoothing
Laplace’s estimate:
Pretend you saw every outcome
r r b
once more than you actually did
Calibration
Weak calibration: higher confidences mean higher accuracy
Strong calibration: confidence predicts accuracy rate
What’s the value of calibration?
Summary
Bayes rule lets us do diagnostic queries with causal probabilities
The naïve Bayes assumption takes all features to be independent given the class label
We can build classifiers out of a naïve Bayes model using training data