Department of Computer Engineering: Experiment No.6
Department of Computer Engineering: Experiment No.6
Experiment No.6
Experiment Number 6
Theory Algorithm:
Baye’s Theorem states: P(c/x) = P(x/c)*P(c)
-----------
P(x)
• It assumes that effect of an attribute value on class
membership probability is independent of value of another attribute.
This is called condition independence.
• Let x1, x2, x3…xn be the data set with ‘m attributes
a1,a2,a3….am.
• Suppose there are c2, c3….cn classes, unknown sample x is
place in the class who’s conditional probability is the highest, i.e.,
P(c/x) = P(x/c)*P(c)
----------------
P(x)
• Since, p(x) will be constant, so
o P(c/x)=p(x/c)*p(c)
o Where, x={x1,x2,x3….xn}
o P(c) = No. of samples belonging to class c(si)
Total no. of samples (s)
=si
s
Example of Bayesian Classifier
Example No. Color Type Origin Stolen?
1 Red Sports Domestic Yes
2 Red Sports Domestic No
3 Red Sports Domestic Yes
4 Yellow Sports Domestic No
Department of Computer Engineering
Experiment No.6
5 Yellow Sports Imported Yes
6 Yellow SUV Imported No
7 Yellow SUV Imported Yes
8 Yellow SUV Domestic No
9 Red SUV Imported No
10 Red Sports Imported Yes
2.2 Training example
We want to classify a Red Domestic SUV. Note there is no example of
a Red Domestic SUV in our data
set. Looking back at equation (2) we can see how to compute this.
We need to calculate the probabilities
P(Red|Yes), P(SUV|Yes), P(Domestic|Yes) ,
P(Red|No) , P(SUV|No), and P(Domestic|No)
1
and multiply them by P(Yes) and P(No) respectively . We can estimate
these values using equation (3).
Yes: No:
Red: Red:
n=5n=5
n_c= 3 n_c = 2
p = .5 p = .5
m=3m=3
SUV: SUV:
n=5n=5
n_c = 1 n_c = 3
p = .5 p = .5
m=3m=3
Domestic: Domestic:
n=5n=5
n_c = 2 n_c = 3
p = .5 p = .5
m = 3 m =3
Looking at P(Red|Y es), we have 5 cases where vj = Yes , and in 3 of
those cases ai = Red. So for
P(Red|Y es), n = 5 and nc = 3. Note that all attribute are binary (two
possible values). We are assuming
no other information so, p = 1 / (number-of-attribute-values) = 0.5
for all of our attributes. Our m value
is arbitrary, (We will use m = 3) but consistent for all attributes. Now
we simply apply eqauation (3)
using the precomputed values of n , nc, p, and m.
P(Red|Y es) = 3 + 3 ∗ .5
5+3
= .56 P(Red|No) = 2 + 3 ∗ .5
5+3
= .43
P(SUV |Y es) = 1 + 3 ∗ .5
5+3
Department of Computer Engineering
Experiment No.6
= .31 P(SUV |No) = 3 + 3 ∗ .5
5+3
= .56
P(Domestic|Y es) = 2 + 3 ∗ .5
5+3
= .43 P(Domestic|No) = 3 + 3 ∗ .5
5+3
= .56
We have P(Y es) = .5 and P(No) = .5, so we can apply equation (2).
For v = Y es, we have
P(Yes) * P(Red | Yes) * P(SUV | Yes) * P(Domestic|Yes)
= .5 * .56 * .31 * .43 = .037
and for v = No, we have
P(No) * P(Red | No) * P(SUV | No) * P (Domestic | No)
= .5 * .43 * .56 * .56 = .069
Code
#Make Predictions with Naive Bayes on the Iris Dataset
from csv import reader
from math import sqrt
from math import exp
from math import pi
#Load CSV file
def load_csv(filename):
dataset = list()
with open(filename, 'r') as file:
csv_reader = reader(file)
for row in csv_reader:
if not row:
continue
dataset.append(row)
return dataset
#Convert String Columns to Float
def str_column_to_float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())
#Convert String Columns to Integer
def str_column_to_int(dataset, column):
class_values = [ row[column] for row in dataset ]
unique = set(class_values)
lookup = dict()
for i, value in enumerate(unique):
lookup[value] = i
print(value+' => '+str(i))
for row in dataset:
row[column] = lookup[row[column]]
Department of Computer Engineering
Experiment No.6
return lookup
#Split the dataset by class values, returns a dictionary
def separate_by_class(dataset):
separated = dict()
for i in range(len(dataset)):
vector = dataset[i]
class_value = vector[-1]
if (class_value not in separated):
separated[class_value] = list()
separated[class_value].append(vector)
return separated
#Calculate the mean of a list of numbers
def mean(numbers):
return sum(numbers)/float(len(numbers))
#Calculate the standard deviation of a list of numbers
def stdev(numbers):
avg = mean(numbers)
variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)
return sqrt(variance)
#Calculate the mean, stdev and count for each column in a dataset
def summarize_dataset(dataset):
summaries = [(mean(column), stdev(column), len(column)) for
column in
zip(*dataset)]
del(summaries[-1])
return summaries
#Split dataset by class then calculate statistics for each row
def summarize_by_class(dataset):
separated = separate_by_class(dataset)
summaries = dict()
for class_value, rows in separated.items():
summaries[class_value] = summarize_dataset(rows)
return summaries
#Calculate the Gaussian probability distribution function for x
def calculate_probability(x, mean, stdev):
exponent = exp(-((x-mean)**2 / (2 * stdev**2 )))
return (1 / (sqrt(2 * pi) * stdev)) * exponent
#Calculate the probabilities of predicting each class for a given row
def calculate_class_probabilities(summaries, row):
total_rows = sum([summaries[label][0][2] for label in summaries])
probabilities = dict()
for class_value, class_summaries in summaries.items():
probabilities[class_value] =
summaries[class_value][0][2]/float(total_rows)
for i in range(len(class_summaries)):
mean, stdev, _ = class_summaries[i]
probabilities[class_value] *= calculate_probability(row[i], mean,
stdev)
Department of Computer Engineering
Experiment No.6
return probabilities
#Predict the class for a given row
def predict(summaries, row):
probabilities = calculate_class_probabilities(summaries, row)
best_label, best_prob = None, -1
for class_value, probability in probabilities.items():
if best_label is None or probability > best_prob:
best_prob = probability
best_label = class_value
return best_label
#Make a prediction with Naive Bayes on Iris Dataset
filename = 'D:\Python\iris.csv'
dataset = load_csv(filename)
for i in range(len(dataset[0])-1):
str_column_to_float(dataset, i)
#Convert class column to integers
str_column_to_int(dataset, len(dataset[0])-1)
#Fit model
model = summarize_by_class(dataset)
#Define a new record
row = [6.9, 3.2, 5.7, 2.3]
#Predict the label
label = predict(model, row)
print('Data='+str(row)+' Predicted='+str(label))
Output
Conclusion Running the data first summarizes the mapping of class labels to
integers and then fits the model on the entire dataset. There are
three class labels. 0,1 & 2. In the output, when a new observation is
defined, a class label is predicted. Here, our observation is predicted
as belonging to class 2 which is “Iris-virginica“