Bayes and Naive Bayes Predictors
Bayes and Naive Bayes Predictors
Let us assume that we have the task of designing a Bayesian machine that can predict whether
a person has meningitis based on a set of three features (symptoms). The available historical
data is given in Table 1.
Table 1: Simple dataset for MENINGITIS diagnosis with descriptive feartures HEADACHE,
FEVER and VOMITING
ID HEADACHE FEVER VOMITING MENINGITIS
1 TRUE TRUE FALSE FALSE
2 FALSE TRUE FALSE FALSE
3 TRUE FALSE TRUE FALSE
4 TRUE FALSE TRUE FALSE
5 FALSE TRUE FALSE TRUE
6 TRUE FALSE TRUE FALSE
7 TRUE FALSE TRUE FALSE
8 TRUE FALSE TRUE TRUE
9 FALSE TRUE FALSE FALSE
10 TRUE FALSE TRUE TRUE
Let us denote the symptoms (features) simply by the following variables, h= HEADACHE,
f=FEVER, v=VOMITING and the disease (target feature) by m=MENINGITIS. In this
example, the values taken by the features are the boolean variable, TRUE or FALSE.
Let us revisit the Bayes’ Theorem and relate it to this example. Given an evidence A and
an outcome B, we can write the posterior probability of the outcome as
P (A|B) × P (B)
P (B|A) = (1)
P (A)
If we have several pieces of evidence as in this example, the Bayes’ Theorem will be written
as
P (q[1], . . . , q[m])|t = l) × P (t = l)
P (t = l|q[1], . . . , q[m]) = (2)
P (q[1], . . . , q[m])
In Equation 2, t is the target feature, l is the value it can take; q[1], . . . , q[m] are the descriptive
features (evidence, or in our case symptoms). Furthermore, P (t = l) is the prior probability
of the target feature taking on some specific values - in our case m = TRUE or FALSE;
P (q[1], . . . , q[m]) is the joint probability of the features (symptoms) having some specific set
of values; P (q[1], . . . , q[m]|t = l) is the conditional probability of the symptoms taking som
especific values given that the target feature took on the value l.
Based on the data in Table 1, let us compute some probabilities:
3
• P (m = T RU E) = P (m) = = 0.3; computed by counting the number of times
10
MENINGITIS has the value TRUE out of the ten rows.
7
• P (m = F ALSE) = P (¬m) = = 0.7; again, computed by counting the number of
10
times MENINGITIS has the value FALSE out of the ten rows.
There is one more consideration before we start to use our Bayesian predictor. In order to
compute the target conditional probability P (q[1], . . . , q[m]|t = l), we could use the dataset
directly or factorise the probability to make computation easier. We can use the Chain Rule
to factorise and write:
This factorisation turns the probability of a set of features conditioned on the target feature,
into a product of probabilities of each feature conditioned on a set of other features and the
target feature.
Suppose a patient was presented to the doctor with HEADACHE = TRUE, FEVER =
FALSE and VOMITING = TRUE. What will our Bayesian predictor advice?
Our solution strategy is to compute the posterior probabilities P (m|h, ¬f, v) and P (¬m|h, ¬f, v),
and select the higher of the two as our prediction.
Equation 4 tells us the probability of having meningitis given the evidence of the symptoms
and Equation 5 tells us otherwise.
We have previously computed P (m) = 0.3 and P (¬m) = 0.7, by counting on the dataset.
6
We can also confirm, from the table that, P (h, ¬f, v) = = 0.6. The conditional probabili-
10
ties can be computed by counting or by using the Chain Rule. Both are easily computed in
this simple example.
P (h, ¬f, v|¬m) =P (h|¬m) × P (¬f |h, ¬m) × P (v|h, ¬f, ¬m)
5 4 4
= × × = 0.7143 × 0.8 × 1.0 = 0.57144
7 5 4
Hence,
and
Based on the given evidence, our Bayesian predictor advices that the patient does not have
meningitis because P (¬m|h, ¬f, v) > P (m|h, ¬f, v). We have chosen the maximum poste-
rior probability. This type of prediction is called maximum a posteriori (MAP) predictor.
This MAP predictor will be written as
This Equation 6 says that our best prediction (l∗ ), is the value of the target feature
(in our example MENINGITIS = TRUE or FALSE ) that gives us the maximum posterior
probability. This is the solution strategy we have followed in this example and it is a very
powerful method.
You might probably notice that, in the example we just concluded, the denominator
P (h, ¬f, v) was the same for both P (m|h, ¬f, v) and P (¬m|h, ¬f, v). Hence we could have
omitted calculating it and just comapre the values of the numerators. If we do this, our MAP
formula will be
There are several issues that could come up as we use this Bayes’ Theorem.
2. There could also be a situation where a probabilityis undefined. For example, in our
Table 1, P (¬v|h, f, m) is undefined because there is no situation where h, f, m are true
simultaneously. Hence we will have a divide by zero - resulting in undefined probability.
P (A|B, C) =P (A|C)
P (A, B|C) =P (A|C) × P (B|C)
Let us assume that the event of the target feature taking a specific value causes the
assignment of values to the descriptive features, q[1], . . . , q[m]. Then the events of each
descriptive feature taking a value are conditionally independent of each other given
the value of the target feature.
So, based on conditional independence, the Chain Rule can be writen as,
Observe that this simplification reduces the number of probabilities we need to compute.
Recompute the posterior probabilities for the case (HEADACHE= TRUE, FEVER =
FALSE, VOMITING = TRUE) that we did earlier, but assume conditonal independence.
Are the posterior probabilities different? Were you able to reach the same prediction
(MENINGITIS = FALSE) using MAP?
If you notice any error in this note, do not hesitate to write to me.