0% found this document useful (0 votes)
62 views27 pages

Lecture 8 - Naive Bayes

The document discusses the Naive Bayes classifier, a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions, and describes how it can be used to predict the probability of a data point belonging to a particular class. It explains the key steps in the Naive Bayes classification algorithm, including calculating the prior and conditional probabilities from training data and making predictions by choosing the class with the highest posterior probability. Examples are provided to illustrate how to apply the Naive Bayes classifier to classify data points based on categorical and continuous attributes.

Uploaded by

Waseem Sajjad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views27 pages

Lecture 8 - Naive Bayes

The document discusses the Naive Bayes classifier, a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions, and describes how it can be used to predict the probability of a data point belonging to a particular class. It explains the key steps in the Naive Bayes classification algorithm, including calculating the prior and conditional probabilities from training data and making predictions by choosing the class with the highest posterior probability. Examples are provided to illustrate how to apply the Naive Bayes classifier to classify data points based on categorical and continuous attributes.

Uploaded by

Waseem Sajjad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

High Impact Skills Development Program

in Artificial Intelligence, Data Science, and Blockchain

Module 1: AI Fundamentals
Lecture 8: Naïve Bayes Classifier

Instructor: Dr Syed Imran Ali


Assistant Professor, SEECS, NUST

Courtesy: Dr. Faisal Shafait and Dr. Adnan ul Hasan 1


Supervised Learning
- Regression
- Classification

2
Bayesian Classification
• Bayesian classifiers are statistical classifiers
• They can predict class membership probabilities, such as
the probability that a given tuple belongs to a particular
class
• Bayesian classificaition is based on Bay’es theorem
• Naïve Bayesian classifiers assume that the effect of an
attribute value on a given class is independent of the values
of the other attributes.
• This assumption is called class conditional independence.
3
Bayes Theorem
• Let X be a data tuple. In Bayesian terms, X is considered
“evidence”. X is described by measurements made on a set of
n attributes.
• Let H be some hypothesis, such as that the data tuple X
belong to a specified class C
• For classification problems, we want to determine P(H|X), the
probability that hypothesis H holds given the “evidence” or
observed data tuple X.
• In other words, we are looking for the probability that tuple X
belongs to class C, given that we know the attribute
description of X.
4
Bayes Theorem

5
Bayes Theorem
• P(H | X) is the posterior probability, or a posteriori probability,
of H conditioned on X
• For example, suppose our world of data tuples is confined to
customers described by the attributes age and income,
respectively, and that X is a 35-year-old customer with an
income of $40,000
• Suppose that H is the hypothesis that our customer will buy a
computer.
• Then P(H | X) reflects the probability that customer X will buy
a computer given that we know the customer’s age and
income.
6
Bayes Theorem
• In contrast, P(H) is the prior probability, or a priori probability,
of H
• For our example, this is the probability that any given
customer will buy a computer, regardless of age, income, or
any other information
• The posterior probability, P(H | X), is based on more
information (e.g., customer information) than the prior
probability, P(H), which is independent of X.

7
Bayes Theorem
• Similarly, P(X | H) is the posterior probability of X conditioned
on H.
• That is, it is the probability that a customer, X, is 35 years old
and earns $40,000, given that we know the customer will buy
a computer.

• P(X) is the prior probability of X. Using our example, it is the


probability that a person from our set of customers is 35 years
old and earns $40,000

8
Bayes Theorem
• How are these probabilities estimated?

• P(H), P(X | H), and P(X) may be estimated from the given data
• Bayes’ theorem is useful in that it provides a way of calculating
the posterior probability, P(H | X), from P(H),
P(X | H), and P(X)

• Now we will look at how Bayes’ theorem is used in the Naive


Bayesian classifier

9
Naïve Bayesian Classification
• Step 1. Let D be a training set of tuples and their associated
class labels.
• As usual, each tuple is represented by an n-dimensional
attribute vector, X = (x1, x2, … , xn), depicting n
measurements made on the tuple from n attributes,
respectively, A1, A2, … , An.

• Step 2. Suppose that there are m classes, C1, C2, … , Cm.


Given a tuple, X, the classifier will predict that X belongs to
the class having the highest posterior probability, conditioned
on X.
10
Naïve Bayesian Classification
• That is, the naïve Bayesian classifier predicts that tuple X
belongs to the class Ci if and only if

• Thus we maximize P(Ci | X). The class Ci for which P(Ci | X)


is maximized is called the maximum posteriori hypothesis

• By Bayes’ theorem

11
Naïve Bayesian Classification
• Step 3. As P(X) is constant for all classes, only P(X |Ci)P(Ci)
need be maximized.
• If the class prior probabilities are not known, then it is
commonly assumed that the classes are equally likely, that is,
P(C1) = P(C2) = = P(Cm), and we would therefore maximize
• P(X | Ci).
• Otherwise, we maximize P(X | Ci)P(Ci).

12
Naïve Bayesian Classification
• Step 4. Given data sets with many attributes, it would be
extremely computationally expensive to compute P(X | Ci). In
order to reduce computation in evaluating P(X |Ci), the naive
assumption of class conditional independence is made.

• This presumes that the values of the attributes are


conditionally independent of one another, given the class label
of the tuple (i.e., that there are no dependence relationships
among the attributes).

13
Naïve Bayesian Classification

• Thus,

• We can easily estimate the probabilities P(x1jCi), P(x2 | Ci),


… , P(xn | Ci) from the training tuples. Recall that here xk
refers to the value of attribute Ak for tuple X.

14
Dataset

15
Naïve Bayes Classifier
• The data tuples are described by the attributes age,
income, student, and credit rating. The class label
attribute, buys computer, has two distinct values
(namely, {yes, no}).
• Let C1 correspond to the class buys computer = yes
and C2 correspond to buys computer = no. The tuple
we wish to classify is:

• X = (age = youth, income = medium, student = yes,


credit rating = fair)

16
Naïve Bayes Classifier
• We need to maximize P(X | Ci)P(Ci), for i = 1, 2. P(Ci),
the prior probability of each class, can be computed
based on the training tuples:

• P(buys computer = yes) = 9/14 = 0.643


• P(buys computer = no) = 5/14 = 0.357

17
Naïve Bayes Classifier
• To compute PX | Ci), for i = 1, 2, we compute the following
conditional probabilities:

• P(age = youth | buys computer = yes) = 2/9 = 0.222


• P(age = youth | buys computer = no) = 3/5 = 0.600
• P(income = medium | buys computer = yes) = 4/9 = 0.444
• P(income = medium | buys computer = no) = 2/5 = 0.400
• P(student = yes | buys computer = yes) = 6/9 = 0.667
• P(student = yes | buys computer = no) = 1/5 = 0.200
• P(credit rating = fair | buys computer = yes) = 6/9 = 0.667
• P(credit rating = fair | buys computer = no) = 2/5 = 0.400
18
Naïve Bayes Classifier
• Using the computed probabilities, we obtain for class ‘yes’:

• P(X | buys computer = yes) = P(age = youth | buys computer = yes)


P(income = medium | buys computer = yes) x
P(student = yes | buys computer = yes) x
P(credit rating = fair | buys computer = yes)
= 0.222 x 0.444 x 0.667 x 0.667 = 0.044

19
Naïve Bayes Classifier
• Similarly, using the computed probabilities, we obtain for class ‘no’:

• P(X | buys computer = no) = P(age = youth | buys computer = no)


P(income = medium | buys computer = no) x
P(student = yes | buys computer = no) x
P(credit rating = fair | buys computer = no)

• P(X | buys computer = no) = 0.600 x 0.400 x 0.200 x 0.400 = 0.019.

20
Naïve Bayes Classifier
• To find the class, Ci, that maximizes P(X|Ci)P(Ci), we compute

• P(X | buys computer = yes)P(buys computer = yes) = 0.044 x 0.643


= 0.028

• P(X | buys computer = no)P(buys computer = no) = 0.019 x 0.357 =


0.007

• Therefore, the naïve Bayesian classifier predicts buys computer =


yes for tuple X.
21
For continuous-valued attributes
• For each attribute, we look at whether the attribute is
categorical or continuous-valued. For instance, to compute
P(X | Ci), we consider the following:
• If Ak is categorical, then P(xk | Ci) is the number of tuples
of class Ci in D having the value xk for Ak, divided by
|Ci,D|, the number of tuples of class Ci in D

• If Ak is continuous-valued, then we need to do a bit more


work,

22
For continuous-valued attributes
• A continuous-valued attribute is typically assumed to have
a Gaussian distribution with a mean μ and standard
deviation s, defined by

• So that

23
For continuous-valued attributes
• We need to compute μCi and sCi , which are the mean
(i.e., average) and standard deviation, respectively, of the
values of attribute Ak for training tuples of class Ci

• We then plug these two quantities into the following


equation, together with xk, in order to estimate P(xk | Ci)

24
For continuous-valued attributes
• For example, let X = (35, $40,000), where A1 and A2 are
the attributes age and income, respectively. Let the class
label attribute be buys computer.
• Let’s suppose that age has not been discretized and
therefore exists as a continuous-valued attribute.
• Suppose that from the training set, we find that customers
in D who buy a computer are 38±12 years of age
• other words, for attribute age and this class, we have μ =
38 years and σ = 12.

25
For continuous-valued attributes
• We can plug these quantities, along with x1 = 35 for our
tuple X into Gaussian distribution equation in order to
estimate P(age = 35 | buys computer = yes)

26
Happy
Learning!

You might also like