Naive Ba Yes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

CHAPTER 26.

PROBABILISTIC METHODS 312

Chapter 26

Probabilistic Methods

Last Modified: 2009-11-04 00:19

26.1 Full Bayes Classifier


Given a training dataset D , with n points, in a d-dimensional space, and assuming that there
are k classes, the Bayes classifier makes uses of the Bayes theorem to predict the class for
a new test instance, x. It tries to estimate the posterior probability P(ci |x) for each class ci ,
and chooses the one that has the largest probability.
Recall that the Bayes theorem allows us to invert the posterior probability in terms of
the likelihood and prior probability, as follows:

P(x|ci ) P(ci )
P(ci |x) = (26.1)
P(x)

Where P(x) is given as follows:


k
P(x) = P(x|c j ) P(c j ) (26.2)
j=1

Let D i denote the subset of points in D that are labeled with class ci , i.e., D i = {x j
D | x j has label y j = ci }. The prior probabilities can be directly estimated from the training
data as follows:
|D i |
P(ci ) = (26.3)
|D |
To estimate the likelihood P(x|ci ), we have to estimate the joint probability of the values
in all the d dimensions P(x1 , x2 , , xd |ci ). Assuming all dimensions are numeric, we can
estimate the joint probability by assuming that each class ci is normally distributed around
some mean i , with a corresponding covariance matrix i . In a d-dimensional space is a

DRAFT @ 2009-11-04 00:19. Please do not distribute. Feedback is Welcome.


Note that this book shall be available for purchase from Cambridge University Press and other standard dis-
tribution channels, that no unauthorized distribution shall be allowed, and that the reader may make one copy
only for personal on-screen use.
CHAPTER 26. PROBABILISTIC METHODS 313

d 1 column vector and i is a d d matrix. Both of these have to be directly estimated


from the training data. Estimating the mean of the class ci is straightforward:

x j D i x j
i = (26.4)
|D i |

The covariance matrix can also be directly estimated from D i , the subset of the data with
class ci , as follows:
X1 X1 X1 X2 . . . X1 Xd
X X X X . . . X X
2 1 2 2 2 d
i = . .. . (26.5)
. . . . .
Xd X1 Xd X2 . . . Xd Xd
where Xi X j is the covariance between dimensions Xi and X j computed only from points in
D i.
Once the class parameters have been estimated, we can use the multivariate normal
density function to return the likelihood:
( )
1 (x i )T 1
i (x i )
i , i ) = d '
P(x|ci ) = f (x| exp (26.6)
2 |i | 2

If the attributes are categorical, estimating the likelihood or the joint probability can
be done by computing the fraction of times the values x = (x1 , x2 , , xd ) co-occur, which
gives us:
# of times (x1 , x2 , , xd ) occurs in D i
P(x|ci ) = (26.7)
|D i |
For both numeric and categorical attributes it is very expensive to evaluate the joint
probability. For instance for numeric attributes we have to estimate O(d 2 ) covariances, and
as the dimensionality increases, this requires us to estimate too many parameters, whereas
we may not have enough data to reliably estimate these many parameters. For categorical
attributes we have to estimate the joint probability via the empirical frequency counts for all
the possible points that may occur, given as i dom(Xi ) . Even if each categorical attribute
has only 2 values, we would need to estimate the probability for 2d , i.e., an exponential
number of points. We may not have enough data to directly estimate the joint probabilities
via counting.

26.2 Nave Bayes Classifier


We saw above that the full Bayes approach is fraught with estimation related problems,
especially with large number of dimensions. The nave Bayes approach makes the nave
assumption that attributes are all independent. This leads to a much simpler, though sur-
prisingly effective approach in practice. The independence assumption immediately implies

DRAFT @ 2009-11-04 00:19. Please do not distribute. Feedback is Welcome.


Note that this book shall be available for purchase from Cambridge University Press and other standard dis-
tribution channels, that no unauthorized distribution shall be allowed, and that the reader may make one copy
only for personal on-screen use.
CHAPTER 26. PROBABILISTIC METHODS 314

that the joint probability can be decomposed into a product of dimension-wise probabilities:
d
P(x|ci ) = P(x1 , x2 , , xd |ci ) = P(x j |ci ) (26.8)
j=1

For numeric data, the naive assumption corresponds to setting all the covariances to
zero in i :
2
X1 X1 0 ... 0 X1 0 ... 0
0
X2 X2 . . . 0 0
2X2 ... 0
i = . . . = .. .. .. (26.9)
.. .. .. . . .
0 0 . . . Xd Xd 0 0 2
. . . Xd

Let us plug this diagonal covariance matrix into (26.6). First note that
d
| i ) = 2X1 2X2 2Xd = 2X j
i | = det( (26.10)
j=1

Also, we have
1
2X
0 ... 0
1 1
0 2X
... 0
1

i = . 2
.. .. (26.11)
.. . .

1
0 0 ... 2X
d

and thus
d (x j ij )2
(x i )T 1
i (x i ) = (26.12)
j=1 2X j
Plugging these into (26.6) gives us:
/ 0
d (x j ij )2
1 j=1

2X


j
P(x|ci ) = d * exp (26.13)
2 dj=1 2X j
2


4 ( )5
d
1 (x j ij )2
= exp (26.14)
j=1 2 X j 22X j
d
= P(x j |ci ) (26.15)
j=1

In other words, the joint probability has been decomposed into a product of the probability
along each dimension, as required by the independence assumption. We now only have
d variances to estimate, and d values to estimate for the means, for a total of only 2d
parameters to estimate per cluster.

DRAFT @ 2009-11-04 00:19. Please do not distribute. Feedback is Welcome.


Note that this book shall be available for purchase from Cambridge University Press and other standard dis-
tribution channels, that no unauthorized distribution shall be allowed, and that the reader may make one copy
only for personal on-screen use.
CHAPTER 26. PROBABILISTIC METHODS 315

For categorical data, the independence assumption leads to the following direct estimate
of the probability per dimension:
# of times value x j occurs in D i
P(x j |ci ) = (26.16)
|D i |

Example 26.1 (Naive Bayes Example:): Let us consider the dataset shown below.
Id Age Car Class
1 25 sports L
2 20 vintage H
3 25 sports L
4 45 suv H
5 20 sports H
6 25 suv H
Assume that we need to classify the new point: (Age: 23, Car: truck). Since each
attribute is independent of the other, we consider them separately. That is
P((23,truck)|H) = P(23|H) P(truck|H)
and likewise,
P((23,truck)|L) = P(23|L) P(truck|L)
For Age, we have D L = {1, 3} and D H = {2, 4, 5, 6}. We can estimate the mean and
variance from these labeled subsets, as shown in the table below:
H L
H = 20+45+20+25
* 4 = 110
4 = 27.5
25+25
L = * 2 = 25
425 0
H = 4 = 10.31 L = 2 = 0
Using the univariate normal distribution, we obtain P(23|H) = N(23|H = 27.5, H =
10.31) = 0.035, and P(23|L) = N(23|L = 25, L = 0) = 0. Note that due to limited data
we obtain L = 0, which leads to a zero likelihood for 23 to come from class L.
For Car, which is categorical, we immediately run into a problem, since the value
truck does not appear in the training set. We could assume that P(truck|H) and
P(truck|L) are both zero. However, we desire to have some small probability of ob-
serving each values in the domain of the attribute. One simple way of obtaining non-zero
probabilities is to do the laplace correction, i.e., to add a of count of one to the observed
counts of each value for each class, as shown in the table below.
H L
1(+1) 2(+1)
P(sports|H) = 4(+4) = 2/8 P(sports|L) = 2(+4) = 3/6
P(vintage|H) = 1(+1)
4(+4) = 2/8
0(+1)
P(vintage|L) = 2(+4) = 1/6
2(+1) 0(+1)
P(suv|H) = 4(+4) = 3/8 P(suv|L) = 2(+4) = 1/6
P(truck|H) = 0(+1)
4(+4) = 1/8
0(+1)
P(truck|L) = 2(+4) = 1/6

DRAFT @ 2009-11-04 00:19. Please do not distribute. Feedback is Welcome.


Note that this book shall be available for purchase from Cambridge University Press and other standard dis-
tribution channels, that no unauthorized distribution shall be allowed, and that the reader may make one copy
only for personal on-screen use.
CHAPTER 26. PROBABILISTIC METHODS 316

Note that all counts are adjusted by (+1) to obtain a non-zero probability in ev-
ery case. Also, assuming that the domain of Car consists of only the four values
{sports, vintage, suv,truck}, this means we also need to increment the denominator by
|domain(Car)| = 4. In other words, the probability of a given value is computed as:

nv + 1
P(v|ci ) = (26.17)
|D i | + |domain(ci )|

where nv is the observed count of value v among the points in D i . Note that instead of the
laplace correction, if we have some prior probability estimate for each value, we can use
that too
nv + Pv
P(v|ci ) = (26.18)
|D i | + v Pv
Using the above probabilities, we finally obtain

P((23,truck)|H) = P(23|H) P(truck|H) = 0.035 1/8 = 0.0044

P((23,truck)|L) = P(23|L) P(truck|L) = 0 1/6 = 0


We next compute

P(23,truck) = P((23,truck)|H) P(H) + P((23,truck)|L) P(L)


4 2
= 0.0044 + 0 = 0.003
6 6
We then obtain the posterior probabilities as follows:

P((23,truck)|H) P(H) 0.004 46


P(H|(23,truck)) = = =1
P(23,truck) 0.003

and
P((23,truck)|L) P(L) 0 62
P(L|(23,truck)) = = =0
P(23,truck) 0.003
Thus we classify (23,truck) as high risk (H).

DRAFT @ 2009-11-04 00:19. Please do not distribute. Feedback is Welcome.


Note that this book shall be available for purchase from Cambridge University Press and other standard dis-
tribution channels, that no unauthorized distribution shall be allowed, and that the reader may make one copy
only for personal on-screen use.

You might also like