0% found this document useful (0 votes)
45 views

Tutorial 4

This document discusses Naive Bayes and Gaussian Bayes classifiers. It begins by introducing Bayes' rule and the Naive Bayes assumption of conditional independence between features given the class. It then provides an example of using a Bernoulli Naive Bayes model for spam classification. The document derives the maximum likelihood estimators for the Naive Bayes model parameters. It also discusses using a Gaussian distribution instead of conditional independence, and derives the maximum likelihood estimators for the Gaussian Bayes model parameters.

Uploaded by

Sriram Mudunuri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Tutorial 4

This document discusses Naive Bayes and Gaussian Bayes classifiers. It begins by introducing Bayes' rule and the Naive Bayes assumption of conditional independence between features given the class. It then provides an example of using a Bernoulli Naive Bayes model for spam classification. The document derives the maximum likelihood estimators for the Naive Bayes model parameters. It also discusses using a Gaussian distribution instead of conditional independence, and derives the maximum likelihood estimators for the Gaussian Bayes model parameters.

Uploaded by

Sriram Mudunuri
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Naive Bayes and Gaussian Bayes Classifier

Mengye Ren
[email protected]

October 18, 2015

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 1 / 21
Naive Bayes

Bayes Rules:
p(x|t)p(t)
p(t|x) =
p(x)
Naive Bayes Assumption:
D
Y
p(x|t) = p(xj |t)
j=1

Likelihood function:

L(θ) = p(x, t|θ) = p(x|t, θ)p(t|θ)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 2 / 21
Example: Spam Classification

Each vocabulary is one feature dimension.


We encode each email as a feature vector x ∈ {0, 1}|V |
xj = 1 iff the vocabulary xj appears in the email.

We want to model the probability of any word xj appearing in an


email given the email is spam or not.
Example: $10,000, Toronto, Piazza, etc.

Idea: Use Bernoulli distribution to model p(xj |t)


Example: p(“$10, 000”|spam) = 0.3

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 3 / 21
Bernoulli Naive Bayes

Assuming all data points x (i) are i.i.d. samples, and p(xj |t) follows a
Bernoulli distribution with parameter µjt

D (i)
Y x (i)
p(x (i) |t (i) ) = µjtj(i) (1 − µjt (i) )(1−xj )

j=1

N N D (i)
Y Y Y x (i)
p(t|x) ∝ p(t (i) )p(x (i) |t (i) ) = p(t (i) ) µjtj(i) (1 − µjt (i) )(1−xj )

i=1 i=1 j=1

where p(t) = πt . Parameters πt , µjt can be learnt using maximum


likelihood.

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 4 / 21
Derivation of maximum likelihood estimator (MLE)

θ = [µ, π]

log L(θ) = log p(x, t|θ)


 
N D
(i) (i)
X X
= log πt (i) + xj log µjt (i) + (1 − xj ) log(1 − µjt (i) )
i=1 j=1

P
Want: arg maxθ log L(θ) subject to k πk = 1

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 5 / 21
Derivation of maximum likelihood estimator (MLE)
Take derivative w.r.t. µ
 
N (i) (i)
∂ log L(θ)   xj 1 − xj
1 t (i) = k 
X
=0⇒ − =0
∂µjk µjk 1 − µjk
i=1

N  h   i
1 t (i) = k (i) (i)
X
xj (1 − µjk ) − 1 − xj µjk = 0
i=1

N   N  
1 t (i) = k µjk = 1 t (i) = k xj(i)
X X

i=1 i=1

i=1 1 t
PN (i)
 (i)
= k xj
µjk =
i=1 1
PN 
t (i) = k

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 6 / 21
Derivation of maximum likelihood estimator (MLE)

Use Lagrange multiplier to derive π


P N
∂L(θ) ∂ κ πκ   1
1 t (i) = k)
X
+λ =0⇒λ=−
∂πk ∂πk πk
i=1

i=1 1
PN
t (i) = k)

πk = −
λ
P
Apply constraint: k πk = 1 ⇒ λ = −N

i=1 1
PN
t (i) = k)

πk =
N

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 7 / 21
Spam Classification Demo

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 8 / 21
Gaussian Bayes Classifier

Instead of assuming conditional independence of xj , we model p(x|t) as a


Gaussian distribution and the dependence relation of xj is encoded in the
covariance matrix.

Multivariate Gaussian distribution:


 
1 1 T −1
f (x) = p exp − (x − µ) Σ (x − µ)
(2π)D det(Σ) 2

µ: mean, Σ: covariance matrix, D: dim(x)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 9 / 21
Derivation of maximum likelihood estimator (MLE)

q
θ = [µ, Σ, π], Z = (2π)D det(Σ)
 
1 1 T −1
p(x|t) = exp − (x − µ) Σ (x − µ)
Z 2

log L(θ) = log p(x, t|θ) = log p(t|θ) + log p(x|t, θ)

N
X 1  (i) T  
= log πt (i) − log Z − x − µt (i) Σ−1
t (i)
x (i)
− µ t (i)
2
i=1

P
Want: arg maxθ log L(θ) subject to k πk = 1

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 10 / 21
Derivation of maximum likelihood estimator (MLE)

Take derivative w.r.t. µ


N
∂ log L X  
=− 1 t (i) = k Σ−1 (x (i) − µk ) = 0
∂µk
i=0

i=1 1 t
PN (i) = k x (i)

µk =
i=1 1
PN 
t (i) = k

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 11 / 21
Derivation of maximum likelihood estimator (MLE)

Take derivative w.r.t. Σ−1 (not Σ)


Note:
∂ det(A) T
= det(A)A−1
∂A
det(A)−1 = det A−1


∂x T Ax
= xx T
∂A
ΣT = Σ

N
" #
∂ log L   ∂ log Zk 1 (i)
1 t =k −
X
(i) (i) T
=− − (x − µk )(x − µk ) =0
∂Σ−1k i=0
∂Σ −1
k
2

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 12 / 21
Derivation of maximum likelihood estimator (MLE)
q
Zk = (2π)D det(Σk )

 1
−1 − 2
∂ log Zk 1 ∂Zk D 1 D ∂ det Σ
= = (2π)− 2 det(Σk )− 2 (2π) 2 k
∂Σ−1k
Z k ∂Σ −1
k ∂Σ −1
k
 
1 1 3
− 2 1
= det(Σ−1 det Σ−1 det Σ−1
 T
k )
2 − k k Σk = − Σk
2 2
N  1 
∂ log L  1 (i)
1 t =k
X
(i) (i) T
=− Σk − (x − µk )(x − µk ) = 0
∂Σ−1k i=0
2 2

T
i=1 1
PN
t (i) = k x (i) − µk x (i) − µk
 
Σk =
i=1 1
PN 
t (i) = k

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 13 / 21
Derivation of maximum likelihood estimator (MLE)

i=1 1
PN
t (i) = k)

πk =
N
(Same as Bernoulli)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 14 / 21
Gaussian Bayes Classifier Demo

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 15 / 21
Gaussian Bayes Classifier

If we constrain Σ to be diagonal, then we can rewrite p(xj |t) as a product


of p(xj |t)
 
1 1 T −1
p(x|t) = p exp − (xj − µjt ) Σt (xk − µkt )
(2π)D det(Σt ) 2

D   D
Y 1 1 Y
= p exp − ||xj − µjt ||22 = p(xj |t)
(2π)D Σt,jj 2Σt,jj
j=1 j=1

Diagonal covariance matrix satisfies the naive Bayes assumption.

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 16 / 21
Gaussian Bayes Classifier

Case 1: The covariance matrix is shared among classes

p(x|t) = N (x|µt , Σ)

Case 2: Each class has its own covariance

p(x|t) = N (x|µt , Σt )

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 17 / 21
Gaussian Bayes Binary Classifier Decision Boundary

If the covariance is shared between classes,

p(x|t = 1) = p(x|t = 0)

1 1
log π1 − (x − µ1 )T Σ−1 (x − µ1 ) = log π0 − (x − µ0 )T Σ−1 (x − µ0 )
2 2

C + x T Σ−1 x − 2µT −1 T −1 T −1 T −1
1 Σ x + µ1 Σ µ1 = x Σ x − 2µ0 Σ x + µ0 Σ µ0
T −1
h i
2(µ0 − µ1 )T Σ−1 x − (µ0 − µ1 )T Σ−1 (µ0 − µ1 ) = C

⇒ aT x − b = 0

The decision boundary is a linear function (a hyperplane in general).

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 18 / 21
Relation to Logistic Regression

We can write the posterior distribution p(t = 0|x) as

p(x, t = 0) π0 N (x|µ0 , Σ)
=
p(x, t = 0) + p(x, t = 1) π0 N (x|µ0 , Σ) + π1 N (x|µ1 , Σ)
  −1
π1 1 T −1 1 T −1
= 1+ exp − (x − µ1 ) Σ (x − µ1 ) + (x − µ0 ) Σ (x − µ0 )
π0 2 2
 
π1 1  T −1 −1
= 1 + exp log + (µ1 − µ0 )T Σ−1 x + µ 1 Σ µ1 − µT
0 Σ −1
µ 0
π0 2
1
=
1 + exp(−w T x − b)

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 19 / 21
Gaussian Bayes Binary Classifier Decision Boundary

If the covariance is not shared between classes,

p(x|t = 1) = p(x|t = 0)

1 1
log π1 − (x − µ1 )T Σ−1 T −1
1 (x − µ1 ) = log π0 − (x − µ0 ) Σ0 (x − µ0 )
2 2
   
x T Σ−1 −1
x − 2 µT −1 T −1
x + µT T

1 − Σ0 1 Σ 1 − µ0 Σ 0 0 Σ0 µ0 − µ1 Σ1 µ1 = C

⇒ x T Qx − 2b T x + c = 0

The decision boundary is a quadratic function. In 2-d case, it looks


like an ellipse, or a parabola, or a hyperbola.

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 20 / 21
Thanks!

Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 21 / 21

You might also like