0% found this document useful (0 votes)
133 views51 pages

Machine Learning and Data Mining: Prof. Alexander Ihler

The document discusses Naive Bayes classifiers. It explains that Naive Bayes classifiers estimate the probability of each class p(y) and the probability of features given each class p(x|y=c). It then uses Bayes rule to calculate the probability of each class given features p(y|x). For discrete features, this can be represented as a contingency table. However, when there are multiple discrete features, it requires estimating a joint probability distribution over all combinations of feature values, which can lead to overfitting when there is insufficient data.

Uploaded by

Suraj Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
133 views51 pages

Machine Learning and Data Mining: Prof. Alexander Ihler

The document discusses Naive Bayes classifiers. It explains that Naive Bayes classifiers estimate the probability of each class p(y) and the probability of features given each class p(x|y=c). It then uses Bayes rule to calculate the probability of each class given features p(y|x). For discrete features, this can be represented as a contingency table. However, when there are multiple discrete features, it requires estimating a joint probability distribution over all combinations of feature values, which can lead to overfitting when there is insufficient data.

Uploaded by

Suraj Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

+

Machine Learning and Data Mining

Bayes Classifiers

Prof. Alexander Ihler


A basic classifier
•  Training data D={x(i),y(i)}, Classifier f(x ; D)
–  Discrete feature vector x
–  f(x ; D) is a con@ngency table
•  Ex: credit ra@ng predic@on (bad/good)
–  X1 = income (low/med/high)
–  How can we make the most # of correct predic@ons?

Features # bad # good


X=0 42 15
X=1 338 287
X=2 3 5

(c) Alexander Ihler 2


A basic classifier
•  Training data D={x(i),y(i)}, Classifier f(x ; D)
–  Discrete feature vector x
–  f(x ; D) is a con@ngency table
•  Ex: credit ra@ng predic@on (bad/good)
–  X1 = income (low/med/high)
–  How can we make the most # of correct predic@ons?

–  Predict more likely outcome


for each possible observa@on Features # bad # good
X=0 42 15
X=1 338 287
X=2 3 5

(c) Alexander Ihler 3


A basic classifier
•  Training data D={x(i),y(i)}, Classifier f(x ; D)
–  Discrete feature vector x
–  f(x ; D) is a con@ngency table
•  Ex: credit ra@ng predic@on (bad/good)
–  X1 = income (low/med/high)
–  How can we make the most # of correct predic@ons?

–  Predict more likely outcome


for each possible observa@on Features # bad # good
X=0 .7368 .2632
–  Can normalize into probability:
p( y=good | X=c ) X=1 .5408 .4592
X=2 .3750 .6250
–  How to generalize?

(c) Alexander Ihler 4

Bayes rule
•  Two events: headache, flu
•  p(H) = 1/10
•  p(F) = 1/40
H
•  p(H|F) = 1/2
F

•  You wake up with a headache – what is the chance that


you have the flu?

Example from Andrew


Moore’s slides
Bayes rule
•  Two events: headache, flu
•  p(H) = 1/10
•  p(F) = 1/40
H
•  p(H|F) = 1/2
F
•  P(H & F) = ?

•  P(F|H) = ?

Example from Andrew


Moore’s slides
Bayes rule
•  Two events: headache, flu
•  p(H) = 1/10
•  p(F) = 1/40
H
•  p(H|F) = 1/2
F
•  P(H & F) = p(F) p(H|F)
= (1/2) * (1/40) = 1/80
•  P(F|H) = ?

Example from Andrew


Moore’s slides
Bayes rule
•  Two events: headache, flu
•  p(H) = 1/10
•  p(F) = 1/40
H
•  p(H|F) = 1/2
F
•  P(H & F) = p(F) p(H|F)
= (1/2) * (1/40) = 1/80
•  P(F|H) = p(H & F) / p(H)
= (1/80) / (1/10) = 1/8

Example from Andrew


Moore’s slides
Classification and probability
•  Suppose we want to model the data

•  Prior probability of each class, p(y)


–  E.g., fraction of applicants that have good credit
•  Distribution of features given the class, p(x | y=c)
–  How likely are we to see “x” in users with good credit?

•  Joint distribution

•  Bayes Rule:

(Use the rule of total probability


to calculate the denominator!)
Bayes classifiers
•  Learn “class conditional” models
–  Estimate a probability model for each class
•  Training data
–  Split by class
–  Dc = { x(j) : y(j) = c }
•  Estimate p(x | y=c) using Dc

•  For a discrete x, this recalculates the same table…

Features # bad # good p(x | y=0) p(x | y=1) p(y=0|x) p(y=1|x)


X=0 42 15 42 / 383 15 / 307 .7368 .2632
X=1 338 287 338 / 383 287 / 307 .5408 .4592
X=2 3 5 3 / 383 5 / 307 .3750 .6250

p(y) 383/690 307/690


Bayes classifiers
•  Learn “class conditional” models
–  Estimate a probability model for each class
•  Training data
–  Split by class
–  Dc = { x(j) : y(j) = c }
•  Estimate p(x | y=c) using Dc

•  For continuous x, can use any density estimate we like


–  Histogram
12
–  Gaussian 10
–  … 8
6
4
2
0
-3
-2 -1 0 1 2 3
Gaussian models
•  Estimate parameters of the Gaussians from the data

Feature x1 !
Multivariate Gaussian models
•  Similar to univariate case

µ = length-d column vector


§ = d x d matrix

|§| = matrix determinant

5
4
Maximum likelihood estimate:
3

-1

-2
-2 -1 0 1 2 3 4 5
Example: Gaussian Bayes for Iris Data
•  Fit Gaussian distribu@on to each class {0,1,2}

(c) Alexander Ihler 14


+

Machine Learning and Data Mining

Bayes Classifiers: Naïve Bayes

Prof. Alexander Ihler


Bayes classifiers
•  Estimate p(y) = [ p(y=0) , p(y=1) …]
•  Estimate p(x | y=c) for each class c
•  Calculate p(y=c | x) using Bayes rule
•  Choose the most likely class c

•  For a discrete x, can represent as a contingency table…


–  What about if we have more discrete features?

Features # bad # good p(x | y=0) p(x | y=1) p(y=0|x) p(y=1|x)


X=0 42 15 42 / 383 15 / 307 .7368 .2632
X=1 338 287 338 / 383 287 / 307 .5408 .4592
X=2 3 5 3 / 383 5 / 307 .3750 .6250

p(y) 383/690 307/690


Joint distributions
A B C
•  Make a truth table of all 0 0 0
combinations of values 0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1
Joint distributions
A B C p(A,B,C | y=1)
•  Make a truth table of all 0 0 0 0.50
combinations of values 0 0 1 0.05
0 1 0 0.01
0 1 1 0.10
•  For each combination of values, 1 0 0 0.04
1 0 1 0.15
determine how probable it is
1 1 0 0.05
1 1 1 0.10

•  Total probability must sum to one

•  How many values did we specify?


Overfitting and density estimation
A B C p(A,B,C | y=1)
•  Estimate probabilities from the data 0 0 0 4/10
–  E.g., how many times (what fraction) 0 0 1 1/10
did each outcome occur? 0 1 0 0/10
0 1 1 0/10
1 0 0 1/10
•  M data << 2^N parameters?
1 0 1 2/10
1 1 0 1/10
•  What about the zeros? 1 1 1 1/10

–  We learn that certain combinations are impossible?


–  What if we see these later in test data?

•  Overfitting!
Overfitting and density estimation
A B C p(A,B,C | y=1)
•  Estimate probabilities from the data 0 0 0 4/10
–  E.g., how many times (what fraction) 0 0 1 1/10
did each outcome occur? 0 1 0 0/10
0 1 1 0/10
1 0 0 1/10
•  M data << 2^N parameters?
1 0 1 2/10
1 1 0 1/10
•  What about the zeros? 1 1 1 1/10

–  We learn that certain combinations are impossible?


–  What if we see these later in test data?

•  One option: regularize


•  Normalize to make sure values sum to one…
Overfitting and density estimation
•  Another option: reduce the model complexity
–  E.g., assume that features are independent of one another

•  Independence:
•  p(a,b) = p(a) p(b)

•  p(x1, x2, … xN | y=1) = p(x1 | y=1) p(x2 | y=1) … p(xN | y=1)


•  Only need to estimate each individually A B C p(A,B,C | y=1)
0 0 0 .4 * .7 * .1
0 0 1 .4 * .7 * .9
0 1 0 .4 * .3 * .1
A p(A |y=1) B p(B |y=1) C p(C |y=1)
0 1 1 …
0 .4 0 .7 0 .1
1 0 0
1 .6 1 .3 1 .9
1 0 1
1 1 0
1 1 1
Example: Naïve Bayes
Observed Data:
x1 x2 y
1 1 0
1 0 0
1 0 1
0 0 0
0 1 1
1 1 0
0 0 1
1 0 1

PredicIon given some observaIon x?


<
>

Decide class 0
(c) Alexander Ihler 22
Example: Naïve Bayes
Observed Data:
x1 x2 y
1 1 0
1 0 0
1 0 1
0 0 0
0 1 1
1 1 0
0 0 1
1 0 1

(c) Alexander Ihler 23


Example: Joint Bayes
Observed Data:
x1 x2 y
1 1 0
1 0 0
1 0 1
x1 x2 p(x | y=0) x1 x2 p(x | y=1)
0 0 0
0 0 1/4 0 0 1/4
0 1 1
0 1 0/4 0 1 1/4
1 1 0
1 0 1/4 1 0 2/4
0 0 1
1 1 2/4 1 1 0/4
1 0 1

(c) Alexander Ihler 24


Naïve Bayes models
•  Variable y to predict, e.g. “auto accident in next year?”
•  We have *many* co-observed vars x=[x1…xn]
–  Age, income, education, zip code, …

•  Want to learn p(y | x1…xn ), to predict y


–  Arbitrary distribution: O(dn) values!

•  Naïve Bayes:
–  p(y|x)= p(x|y) p(y) / p(x) ; p(x|y) = ∏ι p(xi|y)
–  Covariates are independent given “cause”

•  Note: may not be a good model of the data


–  Doesn’t capture correlations in x’s
–  Can’t capture some dependencies
•  But in practice it often does quite well!
Naïve Bayes models for spam
•  y 2 {spam, not spam}
•  X = observed words in email
–  Ex: [“the” … “probabilistic” … “lottery”…]
–  “1” if word appears; “0” if not

•  1000’s of possible words: 21000s parameters?


•  # of atoms in the universe: » 2270…

•  Model words given email type as independent


•  Some words more likely for spam (“lottery”)
•  Some more likely for real (“probabilistic”)
•  Only 1000’s of parameters now…
Naïve Bayes Gaussian models

¾211 0
x2 0 ¾222

Again, reduces the number of parameters of the model:


Bayes: n2/2 Naïve Bayes: n
¾211 > ¾222 x1
You should know…
•  Bayes rule; p( y | x )
•  Bayes classifiers
–  Learn p( x | y=C ) , p( y=C )
•  Naïve Bayes classifiers
–  Assume features are independent given class:
p( x | y=C ) = p( x1 | y=C ) p( x2 | y=C ) …

•  Maximum likelihood (empirical) estimators for


–  Discrete variables
–  Gaussian variables
–  Overfitting; simplifying assumptions or regularization
+

Machine Learning and Data Mining

Bayes Classifiers: Measuring Error

Prof. Alexander Ihler


A Bayes classifier
•  Given training data, compute p( y=c| x) and choose largest
•  What’s the (training) error rate of this method?

Features # bad # good


X=0 42 15
X=1 338 287
X=2 3 5

(c) Alexander Ihler 30


A Bayes classifier
•  Given training data, compute p( y=c| x) and choose largest
•  What’s the (training) error rate of this method?

Features # bad # good


Gets these examples wrong:
X=0 42 15 Pr[ error ] = (15 + 287 + 3) / (690)
X=1 338 287
X=2 3 5 (empirically on training data:
better to use test data)

(c) Alexander Ihler 31


Bayes Error Rate
•  Suppose that we knew the true probabili@es:

–  Observe any x:
(at any x)

–  Op@mal decision at that par@cular x is:

–  Error rate is:


= “Bayes error rate”
•  This is the best that any classifier can do!
•  Measures fundamental hardness of separa@ng y-values given only features x

•  Note: conceptual only!


–  Probabili@es p(x,y) must be es@mated from data
–  Form of p(x,y) is not known and may be very complex
(c) Alexander Ihler 32
A Bayes classifier
•  Bayes classification decision rule compares probabilities:
<
>
<
= >
•  Can visualize this nicely if x is a scalar:

p(x , y=0 )
Decision boundary
Shape: p(x | y=0 ) p(x , y=1 ) Shape: p(x | y=1 )
Area: p(y=0)
Area: p(y=1)

Feature x1 !
A Bayes classifier Add mul@plier alpha:
<
•  Not all errors are created equally… >
•  Risk associated with each outcome?

p(x , y=0 ) Decision boundary


p(x , y=1 )

{
{ Type 1 errors: false posi@ves
Type 2 errors: false nega@ves

False positive rate: (# y=0, ŷ=1) / (#y=0)


False negative rate: (# y=1, ŷ=0) / (#y=1)
A Bayes classifier Add mul@plier alpha:
<
•  Increase alpha: prefer class 0 >
•  Spam detection

p(x , y=0 ) Decision boundary


p(x , y=1 )

{
{
Type 1 errors: false posi@ves
Type 2 errors: false nega@ves

False positive rate: (# y=0, ŷ=1) / (#y=0)


False negative rate: (# y=1, ŷ=0) / (#y=1)
A Bayes classifier Add mul@plier alpha:
<
•  Decrease alpha: prefer class 1 >
•  Cancer detection

p(x , y=0 ) Decision boundary


p(x , y=1 )
{
{ Type 1 errors: false posi@ves
Type 2 errors: false nega@ves

False positive rate: (# y=0, ŷ=1) / (#y=0)


False negative rate: (# y=1, ŷ=0) / (#y=1)
Measuring errors
•  Confusion matrix Predict 0 Predict 1
•  Can extend to more classes
Y=0 380 5
Y=1 338 3

•  True positive rate: #(y=1 , ŷ=1) / #(y=1) -- “sensitivity”


•  False negative rate: #(y=1 , ŷ=0) / #(y=1)
•  False positive rate: #(y=0 , ŷ=1) / #(y=0)
•  True negative rate: #(y=0 , ŷ=0) / #(y=0) -- “specificity”
Likelihood ra@o tests
•  Connec@on to classical, sta@s@cal decision theory:
< <
> = >
“log likelihood ratio”

•  Likelihood ra@o: rela@ve support for observa@on “x” under


“alterna@ve hypothesis” y=1, compared to “null hypothesis” y=0

•  Can vary the decision threshold: <


>

•  Classical tes@ng:
–  Choose gamma so that FPR is fixed (“p-value”)
–  Given that y=0 is true, what’s the probability we decide y=1?

(c) Alexander Ihler 38


ROC Curves
•  Characterize performance as we vary the decision threshold?

Bayes classifier,
multiplier alpha Guess all 1
True positive rate

<
= sensitivity

>
Guess at random, proportion alpha

False positive rate


= 1 - specificity

Guess all 0
(c) Alexander Ihler 39
Probabilis@c vs. Discrimina@ve learning

“Discrimina@ve” learning: “Probabilis@c” learning:


Output predic@on ŷ(x) Output probability p(y|x)
(expresses confidence in outcomes)
•  “Probabilis@c” learning
–  Condi@onal models just explain y: p(y|x)
–  Genera@ve models also explain x: p(x,y)
•  Oqen a component of unsupervised or semi-supervised learning
–  Bayes and Naïve Bayes classifiers are genera@ve models

(c) Alexander Ihler 40


Probabilis@c vs. Discrimina@ve learning

“Discrimina@ve” learning: “Probabilis@c” learning:


Output predic@on ŷ(x) Output probability p(y|x)
(expresses confidence in outcomes)

•  Can use ROC curves for discrimina@ve models also:


–  Some no@on of confidence, but doesn’t correspond to a probability
–  In our code: “predictSoq” (vs. hard predic@on, “predict”)
>> learner = gaussianBayesClassify(X,Y); % build a classifier
>> Ysoq = predictSoq(learner, X); % N x C matrix of confidences
>> plotSoqClassify2D(learner,X,Y); % shaded confidence plot
(c) Alexander Ihler 41
ROC Curves
•  Characterize performance as we vary our confidence threshold?

Classifier B
Guess all 1
True positive rate
= sensitivity

Classifier A Guess at random, proportion alpha

False positive rate Reduce performance to one number?


= 1 - specificity
AUC = “area under the ROC curve”
Guess all 0 0.5 < AUC < 1
(c) Alexander Ihler 42
+

Machine Learning and Data Mining

Gaussian Bayes Classifiers

Prof. Alexander Ihler


Gaussian models
•  “Bayes optimal” decision
–  Choose most likely class
•  Decision boundary
–  Places where probabilities equal

•  What shape is the boundary?


Gaussian models
•  Bayes optimal decision boundary
–  p(y=0 | x) = p(y=1 | x)
–  Transition point between p(y=0|x) >/< p(y=1|x)
•  Assume Gaussian models with equal covariances
Gaussian example
•  Spherical covariance: § = ¾2 I
•  Decision rule

-1

-2
-2 -1 0 1 2 3 4 5
Non-spherical Gaussian distribu@ons
•  Equal covariances => still linear decision rule
–  May be “modulated” by variance direction
–  Scales; rotates (if correlated)

Ex:
2

Variance 1

[3 0 ]
[ 0 .25 ] 0

-1

-2
-10 -8 -6 -4 -2 0 2 4 6 8 10
Class posterior probabilities
•  Useful to also know class probabilities
•  Some notation
–  p(y=0) , p(y=1) – class prior probabilities
•  How likely is each class in general?
–  p(x | y=c) – class conditional probabilities
•  How likely are observations “x” in that class?
–  p(y=c | x) – class posterior probability
•  How likely is class c given an observation x?
Class posterior probabilities
•  Useful to also know class probabilities
•  Some notation
–  p(y=0) , p(y=1) – class prior probabilities
•  How likely is each class in general?
–  p(x | y=c) – class conditional probabilities
•  How likely are observations “x” in that class?
–  p(y=c | x) – class posterior probability
•  How likely is class c given an observation x?

•  We can compute posterior using Bayes’ rule


–  p(y=c | x) = p(x|y=c) p(y=c) / p(x)
•  Compute p(x) using sum rule / law of total prob.
–  p(x) = p(x|y=0) p(y=0) + p(x|y=1)p(y=1)
Class posterior probabilities
•  Consider comparing two classes
–  p(x | y=0) * p(y=0) vs p(x | y=1) * p(y=1)
–  Write probability of each class as
–  p(y=0 | x) = p(y=0, x) / p(x)
–  = p(y=0, x) / ( p(y=0,x) + p(y=1,x) )
–  = 1 / (1 + exp( -a ) ) (**)

–  a = log [ p(x|y=0) p(y=0) / p(x|y=1) p(y=1) ]


–  (**) called the logistic function, or logistic sigmoid.
Gaussian models
•  Return to Gaussian models with equal covariances

(**)

Now we also know that the probability of each class is given by:
p(y=0 | x) = Logistic( ** ) = Logistic( aT x + b )

We’ll see this form again soon…

You might also like