Lecture 8 - Naive Bayes
Lecture 8 - Naive Bayes
Module 1: AI Fundamentals
Lecture 8: Naïve Bayes Classifier
2
Bayesian Classification
• Bayesian classifiers are statistical classifiers
• They can predict class membership probabilities, such as
the probability that a given tuple belongs to a particular
class
• Bayesian classificaition is based on Bay’es theorem
• Naïve Bayesian classifiers assume that the effect of an
attribute value on a given class is independent of the values
of the other attributes.
• This assumption is called class conditional independence.
3
Bayes Theorem
• Let X be a data tuple. In Bayesian terms, X is considered
“evidence”. X is described by measurements made on a set of
n attributes.
• Let H be some hypothesis, such as that the data tuple X
belong to a specified class C
• For classification problems, we want to determine P(H|X), the
probability that hypothesis H holds given the “evidence” or
observed data tuple X.
• In other words, we are looking for the probability that tuple X
belongs to class C, given that we know the attribute
description of X.
4
Bayes Theorem
5
Bayes Theorem
• P(H | X) is the posterior probability, or a posteriori probability,
of H conditioned on X
• For example, suppose our world of data tuples is confined to
customers described by the attributes age and income,
respectively, and that X is a 35-year-old customer with an
income of $40,000
• Suppose that H is the hypothesis that our customer will buy a
computer.
• Then P(H | X) reflects the probability that customer X will buy
a computer given that we know the customer’s age and
income.
6
Bayes Theorem
• In contrast, P(H) is the prior probability, or a priori probability,
of H
• For our example, this is the probability that any given
customer will buy a computer, regardless of age, income, or
any other information
• The posterior probability, P(H | X), is based on more
information (e.g., customer information) than the prior
probability, P(H), which is independent of X.
7
Bayes Theorem
• Similarly, P(X | H) is the posterior probability of X conditioned
on H.
• That is, it is the probability that a customer, X, is 35 years old
and earns $40,000, given that we know the customer will buy
a computer.
8
Bayes Theorem
• How are these probabilities estimated?
• P(H), P(X | H), and P(X) may be estimated from the given data
• Bayes’ theorem is useful in that it provides a way of calculating
the posterior probability, P(H | X), from P(H),
P(X | H), and P(X)
9
Naïve Bayesian Classification
• Step 1. Let D be a training set of tuples and their associated
class labels.
• As usual, each tuple is represented by an n-dimensional
attribute vector, X = (x1, x2, … , xn), depicting n
measurements made on the tuple from n attributes,
respectively, A1, A2, … , An.
• By Bayes’ theorem
11
Naïve Bayesian Classification
• Step 3. As P(X) is constant for all classes, only P(X |Ci)P(Ci)
need be maximized.
• If the class prior probabilities are not known, then it is
commonly assumed that the classes are equally likely, that is,
P(C1) = P(C2) = = P(Cm), and we would therefore maximize
• P(X | Ci).
• Otherwise, we maximize P(X | Ci)P(Ci).
12
Naïve Bayesian Classification
• Step 4. Given data sets with many attributes, it would be
extremely computationally expensive to compute P(X | Ci). In
order to reduce computation in evaluating P(X |Ci), the naive
assumption of class conditional independence is made.
13
Naïve Bayesian Classification
• Thus,
14
Dataset
15
Naïve Bayes Classifier
• The data tuples are described by the attributes age,
income, student, and credit rating. The class label
attribute, buys computer, has two distinct values
(namely, {yes, no}).
• Let C1 correspond to the class buys computer = yes
and C2 correspond to buys computer = no. The tuple
we wish to classify is:
16
Naïve Bayes Classifier
• We need to maximize P(X | Ci)P(Ci), for i = 1, 2. P(Ci),
the prior probability of each class, can be computed
based on the training tuples:
17
Naïve Bayes Classifier
• To compute PX | Ci), for i = 1, 2, we compute the following
conditional probabilities:
19
Naïve Bayes Classifier
• Similarly, using the computed probabilities, we obtain for class ‘no’:
20
Naïve Bayes Classifier
• To find the class, Ci, that maximizes P(X|Ci)P(Ci), we compute
22
For continuous-valued attributes
• A continuous-valued attribute is typically assumed to have
a Gaussian distribution with a mean μ and standard
deviation s, defined by
• So that
23
For continuous-valued attributes
• We need to compute μCi and sCi , which are the mean
(i.e., average) and standard deviation, respectively, of the
values of attribute Ak for training tuples of class Ci
24
For continuous-valued attributes
• For example, let X = (35, $40,000), where A1 and A2 are
the attributes age and income, respectively. Let the class
label attribute be buys computer.
• Let’s suppose that age has not been discretized and
therefore exists as a continuous-valued attribute.
• Suppose that from the training set, we find that customers
in D who buy a computer are 38±12 years of age
• other words, for attribute age and this class, we have μ =
38 years and σ = 12.
25
For continuous-valued attributes
• We can plug these quantities, along with x1 = 35 for our
tuple X into Gaussian distribution equation in order to
estimate P(age = 35 | buys computer = yes)
26
Happy
Learning!