0% found this document useful (0 votes)
3 views24 pages

5naive Bayes

Uploaded by

shukladinesh0206
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views24 pages

5naive Bayes

Uploaded by

shukladinesh0206
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Naive Bayes

S. Sumitra
Department of Mathematics
Indian Institute of Space Science and Technology

MA613 Data Mining


Characteristics

Binary Classification
Attributes are Discrete valued
p(y )p(x | y )
Bayes Theorem: p(y | x) =
p(x)
Joint Probability distribution

P(AB) = P(A)P(B | A) = P(B)P(A | B)

P(AB/C) = P(A/C)P(B/AC) (1)

P(ABC) = P(A)P(BC | A)
= P(A)P(B/A)P(C/AB)

P(X1 , X2 , . . . Xn ) = P(X1 )P(X2 | X1 )P(X3 | X1 X2 ) . . .


P(Xn | X1 , X2 , . . . Xn−1 )
n
Y
= P(Xi | X1 , X2 , . . . Xi−1 )
i=1

Different ordering of Xi′ s lead to different factorizations.


Conditional Independence

A is said to be conditionally independent of B given C if

P(A | BC) = P(A | C) (2)

That is, learning the values of B does not change prediction of


A once we know the values of C . The conditional
independence relation is symmetric. Therefore if (2) is true then

P(B | AC) = P(B | C) (3)

Hence (1) becomes,

P(AB/C) = P(A/C)P(B/C) (4)


Conditional Independence

Definition
Let X , Y , and Z be sets of random variables. X is conditionally
independent of Y given Z in a distribution P if
P(X = x, Y = y /Z = z) = P(X = x/Z = z)P(Y = y /Z =
z)∀x ∈ Range(X ), y ∈ Range(Y ), z ∈ Range(Z ).

The notation used for conditional independence is: X ⊥⊥ Y |Z


means X is conditionally independent of Y given Z .
X and Y are said to be independent (marginally independent) if
P(XY ) = P(X )(Y ). The notation used for independence is:
X ⊥⊥ Y or X ⊥⊥ Y |∅.
Naive Bayes

Binary Classification
{(x1 , y1 ), (x2 , y2 ), . . . (xN , yN )} be the given data where
xi ∈ Rn , yi ∈ {1, 0}.
Discrete valued attributes
Bayes Theorem
p(y = 1)p(x | y )
p(y = 1 | x) =
p(x)
Conditional Independence

Attributes are conditionally independent given y


xi = (xi1 , xi2 , . . . , xin )T is a random variable
P(xi | y ) = p(xi1 | y )p(xi2 | y ) . . . p(xin | y )
p(yi = 1 | xi ) =
p(yi = 1)p(xi1 | y = 1)p(xi2 | y = 1) . . . p(xin | y = 1)
p(xi )
Email Spam Classification

Email 1 : Are you here? spam


Email 2: You are there ham
Email 3: you here spam
Email 4: Here there ham
Dictionary:{are, here, there, you}

Data are here there you yi : 1/0


x1T 1 1 0 1 1
x2T 1 0 1 1 0
x3T 0 1 0 1 1
x4T 0 1 1 0 0
Example

x = (W = F , X = T , Y = F )T
Categorical Attributes & Multiclass

Data A1 A2 A3 A4 yi : 1/0
x1T 1 1 0 1 1
x2T 2 0 1 1 2
x3T 1 1 0 1 1
x4T 3 1 1 0 0
x5T 3 2 1 0 0
x6T 1 2 1 0 2

x T = (3, 0, 1, 1)
Categorical Attributes

The generalization to where Ai can take values in {1, 2, . . . k } is


straightforward.
x = (x 1 , x 2 , . . . x n )T

#samples with label c and Aj = α


p(x j = α | y = c) =
# samples with label c
P
i Pij = α ∩ y = c)
1(x
=
i 1(yi = c)
Continuous Attributes

Apply discretization for continuous attributes.


Small set of discrete values and apply Naive Bayes
MAP Decision Rule

Maximum a posteriori (MAP) criteria is used to determine


the class in which the given data belongs.
MAP decision rule is as follows:
The given data x T = (x (1) , x (2) , . . . x (n) ) belongs to ŷ = l
where l ∈ {1, 2, . . . k } if
Y n
ŷ = argmax p(y = c) p(x (j) | y = c) =
c∈{1,...,K } j=1
n
Y
argmax πc p(x (j) | y = c) where πc = p(y = c)
c∈{1,...,K } j=1
Laplace Smoothing

x = (x (1) , x (2) , . . . x (n) )T


Bernoulli

(j) #data with label c and Aj = α + 1
p(x = α | y = c) =
# samples with label c + 2
P
i 1(x = α ∩ y = c) + 1
= Pij
i 1(yi = c) + 2

Categorical Distribution: l different values


P
i 1(x = α ∩ y = c) + 1
(j)
p(x = α | y = c) = Pij
i 1(yi = c) + l
Example

Data are here there you yi : 1/0


x1T 1 1 0 0 1
x2T 1 0 1 0 0
x3T 0 1 0 0 1
x4T 0 1 1 0 0

Here you are


x T = (1, 1, 0, 1)
Number of times such that A1 = 1 and y = 1 appears is 1
Add two more results in this set whose outcomes are 1 and
0. Therefore number of times such that A1 = 1 and y = 1
appears is 2. No of results = 3.
p(A1 = 1 | y = 1) = 2/3
Bernoulli, Binomial, Categorical and Multinomial
Distribution
Bernoulli
Two outcomes, 1 trial
Coin tosssing: H,T,H
x1 = 1, x2 = 0, x3 = 1
Binomial
Two outcomes, m >1 trials
Coin tosssing 2 times : HH,TH,HT
Aj no of times j th outcome appears
x1 = (2, 0)T , x2 = (1, 1)T , x3 = (1, 1)T
Categorical Distribution
n > 2 outcomes, 1 trial
Die tosssing : 1,2,5,
x1 = 1, x2 = 2, x3 = 5
Multinomial Distribution
n > 2 outcomes, m >1 trials
Aj no of times j th outcome appears
Die tossing :x1 = (12, 23, 10, 5, 6, 8)T , x2 = (0, 1, 0, 1, 0, 0)T
Multinomial Distribution

m trials
n outcomes
pi is the probability related with i th outcome where
i = 1, 2, . . . n
Consider the data x T = (x 1 , x 2 , . . . x n ) where x j is the number
of times j th outcome appears in m trials where j = 1, 2, . . . n.

m!

x1 xn Pn j
 x 1 ! · · · x n ! p1 × · · · × pn , when j=1 x = m


1 2 n
p(x , x , . . . x ) =


0 otherwise

Naive Bayes

Two types
Categorical features
Aj ∈ 1, 2, . . . l
p(x j = αP | y = c) = [θαc ]j
i 1(x ij = α ∩ y = c) + 1
[θαc ]j = P
Q1(y = c) + l
ŷ = arg maxc πc j [θαc ]j
Multinomial features
Multinomial Naive Bayes

x T = (x (1) , . . . , x (n) ) , x (j) is the number of times j th event is


observed in a particular P instance.
x (j) ∈ {1, 2, . . . m}, nj=1 x (j) = m
x | y = c follows the multinomial distribution with
parameters (p1c , p2c , . . . pnc ) and m. Here, pjc is the
probability of happening j th event in class c
n
m! Y (j)
p(x | y = c) = (1) (n)
pjcx .
x !···x !
j=1
Email Spam Classification: Multinomial event model

Data are here there you yi : 1/0


x1T 20 10 0 10 1
x2T 30 90 8 7 0
x3T 120 0 145 89 1
x4T 1 0 93 0 0
Email Spam Classification: Multinomial event model:
Case 1

Data {(xi , yi )}i = 1, 2, . . . N, xiT = (xi1 , xi2 , . . . xi|V | ), where xij is


the number of times j th dictionary word appears in i th email.
Here, |V | is the number of words in the dictionary.
For example: x = (300, 50, 1)T
x | c ∼ Multinomial Distribution with number of outcomes
is equal to number of words in the dictionary.
pjc is the probability of finding j th dictionary word in class c.
PN
xij (yi = c) + 1
For j = 1, 2, . . . |V |, pjc = PNi=1 where
i=1 ni (yi = c) + |V |
P|V |
ni = j=1 xij
(j)
ŷ = arg maxc πc j (pjc )x
Q
Email Spam Classification: Multinomial event model:
Case II

Data {(xi , yi ), i = 1, 2, . . . N}, xi = (xi1 , xi2 , . . . xini ), where xij


is a word in the dictionary.
For example x: There is a
x | c ∼ Multinomial Distribution with number of outcomes
is equal to number of words in the dictionary
PN Pni
i=1 j=1 1(xij = α ∩ y = c) + 1
pαc = P
ni (y = c) + |V |

where α is a word in the dictionary.


Multinomial Naive Bayes: Decision Boundary
(j)
(pjc )x
Q
ŷ = arg maxc πc j
 Q (j)

ŷ = arg maxc log πc j (pjc )x
 
n
(j)
Y X
log πc (pjc )x  = log πc + x (j) · log pjc
j j=1

= wo + w⊤
cx

where w0 = log πc and wjc = log pjc


The decision boundary between classes k and m can be
obtained by equating
n
X n
X
log πk + x (j) · log pjk = log πm + x (j) · log pjm
j=1 j=1

You might also like