Machine Learning and Data Mining: Prof. Alexander Ihler
Machine Learning and Data Mining: Prof. Alexander Ihler
Bayes Classifiers
• P(F|H) = ?
• Joint distribution
• Bayes Rule:
Feature x1 !
Multivariate Gaussian models
• Similar to univariate case
5
4
Maximum likelihood estimate:
3
-1
-2
-2 -1 0 1 2 3 4 5
Example: Gaussian Bayes for Iris Data
• Fit Gaussian distribu@on to each class {0,1,2}
• Overfitting!
Overfitting and density estimation
A B C p(A,B,C | y=1)
• Estimate probabilities from the data 0 0 0 4/10
– E.g., how many times (what fraction) 0 0 1 1/10
did each outcome occur? 0 1 0 0/10
0 1 1 0/10
1 0 0 1/10
• M data << 2^N parameters?
1 0 1 2/10
1 1 0 1/10
• What about the zeros? 1 1 1 1/10
• Independence:
• p(a,b) = p(a) p(b)
Decide class 0
(c) Alexander Ihler 22
Example: Naïve Bayes
Observed Data:
x1 x2 y
1 1 0
1 0 0
1 0 1
0 0 0
0 1 1
1 1 0
0 0 1
1 0 1
• Naïve Bayes:
– p(y|x)= p(x|y) p(y) / p(x) ; p(x|y) = ∏ι p(xi|y)
– Covariates are independent given “cause”
¾211 0
x2 0 ¾222
– Observe any x:
(at any x)
p(x , y=0 )
Decision boundary
Shape: p(x | y=0 ) p(x , y=1 ) Shape: p(x | y=1 )
Area: p(y=0)
Area: p(y=1)
Feature x1 !
A Bayes classifier Add mul@plier alpha:
<
• Not all errors are created equally… >
• Risk associated with each outcome?
{
{ Type 1 errors: false posi@ves
Type 2 errors: false nega@ves
{
{
Type 1 errors: false posi@ves
Type 2 errors: false nega@ves
• Classical tes@ng:
– Choose gamma so that FPR is fixed (“p-value”)
– Given that y=0 is true, what’s the probability we decide y=1?
Bayes classifier,
multiplier alpha Guess all 1
True positive rate
<
= sensitivity
>
Guess at random, proportion alpha
Guess all 0
(c) Alexander Ihler 39
Probabilis@c vs. Discrimina@ve learning
Classifier B
Guess all 1
True positive rate
= sensitivity
-1
-2
-2 -1 0 1 2 3 4 5
Non-spherical Gaussian distribu@ons
• Equal covariances => still linear decision rule
– May be “modulated” by variance direction
– Scales; rotates (if correlated)
Ex:
2
Variance 1
[3 0 ]
[ 0 .25 ] 0
-1
-2
-10 -8 -6 -4 -2 0 2 4 6 8 10
Class posterior probabilities
• Useful to also know class probabilities
• Some notation
– p(y=0) , p(y=1) – class prior probabilities
• How likely is each class in general?
– p(x | y=c) – class conditional probabilities
• How likely are observations “x” in that class?
– p(y=c | x) – class posterior probability
• How likely is class c given an observation x?
Class posterior probabilities
• Useful to also know class probabilities
• Some notation
– p(y=0) , p(y=1) – class prior probabilities
• How likely is each class in general?
– p(x | y=c) – class conditional probabilities
• How likely are observations “x” in that class?
– p(y=c | x) – class posterior probability
• How likely is class c given an observation x?
(**)
Now we also know that the probability of each class is given by:
p(y=0 | x) = Logistic( ** ) = Logistic( aT x + b )