0% found this document useful (0 votes)
68 views5 pages

Department of Computer Engineering: Experiment No.6

1. This document describes an experiment implementing Bayesian classification on a dataset. It provides the theory behind Bayesian classification, which involves calculating the probabilities of class membership for new data based on attribute probabilities learned from training data. 2. The code loads and preprocesses a sample iris dataset, calculates statistics to summarize the training data by class, and uses these summaries to predict the class of new data by calculating probability distributions for each class. 3. The document includes an example predicting whether a new data point is stolen or not, showing the probability calculations and predicting the class with the highest probability.

Uploaded by

Bhumi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views5 pages

Department of Computer Engineering: Experiment No.6

1. This document describes an experiment implementing Bayesian classification on a dataset. It provides the theory behind Bayesian classification, which involves calculating the probabilities of class membership for new data based on attribute probabilities learned from training data. 2. The code loads and preprocesses a sample iris dataset, calculates statistics to summarize the training data by class, and uses these summaries to predict the class of new data by calculating probability distributions for each class. 3. The document includes an example predicting whether a new data point is stolen or not, showing the probability calculations and predicting the class with the highest probability.

Uploaded by

Bhumi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Department of Computer Engineering

Experiment No.6

Semester T.E. Semester V– Computer Engineering


Subject Data warehousing & Mining Lab
Subject Professor In-charge Prof. Prita Patil
Assisting Teachers Prof. Prita Patil
Laboratory MS-Team

Student Name Pratik Haldankar


Roll Number 20102A2006
Grade and Subject
Teacher’s Signature

Experiment Number 6

Experiment Title Implementation of Bayesian Classification


Resources / Apparatus Required Hardware: Software:
Computer system Text Editor(MS-Word)

Theory Algorithm:
Baye’s Theorem states: P(c/x) = P(x/c)*P(c)
-----------
P(x)
• It assumes that effect of an attribute value on class
membership probability is independent of value of another attribute.
This is called condition independence.
• Let x1, x2, x3…xn be the data set with ‘m attributes
a1,a2,a3….am.
• Suppose there are c2, c3….cn classes, unknown sample x is
place in the class who’s conditional probability is the highest, i.e.,
P(c/x) = P(x/c)*P(c)
----------------
P(x)
• Since, p(x) will be constant, so
o P(c/x)=p(x/c)*p(c)
o Where, x={x1,x2,x3….xn}
o P(c) = No. of samples belonging to class c(si)
Total no. of samples (s)
=si
s
Example of Bayesian Classifier
Example No. Color Type Origin Stolen?
1 Red Sports Domestic Yes
2 Red Sports Domestic No
3 Red Sports Domestic Yes
4 Yellow Sports Domestic No
Department of Computer Engineering
Experiment No.6
5 Yellow Sports Imported Yes
6 Yellow SUV Imported No
7 Yellow SUV Imported Yes
8 Yellow SUV Domestic No
9 Red SUV Imported No
10 Red Sports Imported Yes
2.2 Training example
We want to classify a Red Domestic SUV. Note there is no example of
a Red Domestic SUV in our data
set. Looking back at equation (2) we can see how to compute this.
We need to calculate the probabilities
P(Red|Yes), P(SUV|Yes), P(Domestic|Yes) ,
P(Red|No) , P(SUV|No), and P(Domestic|No)
1
and multiply them by P(Yes) and P(No) respectively . We can estimate
these values using equation (3).
Yes: No:
Red: Red:
n=5n=5
n_c= 3 n_c = 2
p = .5 p = .5
m=3m=3
SUV: SUV:
n=5n=5
n_c = 1 n_c = 3
p = .5 p = .5
m=3m=3
Domestic: Domestic:
n=5n=5
n_c = 2 n_c = 3
p = .5 p = .5
m = 3 m =3
Looking at P(Red|Y es), we have 5 cases where vj = Yes , and in 3 of
those cases ai = Red. So for
P(Red|Y es), n = 5 and nc = 3. Note that all attribute are binary (two
possible values). We are assuming
no other information so, p = 1 / (number-of-attribute-values) = 0.5
for all of our attributes. Our m value
is arbitrary, (We will use m = 3) but consistent for all attributes. Now
we simply apply eqauation (3)
using the precomputed values of n , nc, p, and m.
P(Red|Y es) = 3 + 3 ∗ .5
5+3
= .56 P(Red|No) = 2 + 3 ∗ .5
5+3
= .43
P(SUV |Y es) = 1 + 3 ∗ .5
5+3
Department of Computer Engineering
Experiment No.6
= .31 P(SUV |No) = 3 + 3 ∗ .5
5+3
= .56
P(Domestic|Y es) = 2 + 3 ∗ .5
5+3
= .43 P(Domestic|No) = 3 + 3 ∗ .5
5+3
= .56
We have P(Y es) = .5 and P(No) = .5, so we can apply equation (2).
For v = Y es, we have
P(Yes) * P(Red | Yes) * P(SUV | Yes) * P(Domestic|Yes)
= .5 * .56 * .31 * .43 = .037
and for v = No, we have
P(No) * P(Red | No) * P(SUV | No) * P (Domestic | No)
= .5 * .43 * .56 * .56 = .069

Since 0.069 > 0.037, our example gets classified as ’NO’

Code
#Make Predictions with Naive Bayes on the Iris Dataset
from csv import reader
from math import sqrt
from math import exp
from math import pi
#Load CSV file
def load_csv(filename):
dataset = list()
with open(filename, 'r') as file:
csv_reader = reader(file)
for row in csv_reader:
if not row:
continue
dataset.append(row)
return dataset
#Convert String Columns to Float
def str_column_to_float(dataset, column):
for row in dataset:
row[column] = float(row[column].strip())
#Convert String Columns to Integer
def str_column_to_int(dataset, column):
class_values = [ row[column] for row in dataset ]
unique = set(class_values)
lookup = dict()
for i, value in enumerate(unique):
lookup[value] = i
print(value+' => '+str(i))
for row in dataset:
row[column] = lookup[row[column]]
Department of Computer Engineering
Experiment No.6
return lookup
#Split the dataset by class values, returns a dictionary
def separate_by_class(dataset):
separated = dict()
for i in range(len(dataset)):
vector = dataset[i]
class_value = vector[-1]
if (class_value not in separated):
separated[class_value] = list()
separated[class_value].append(vector)
return separated
#Calculate the mean of a list of numbers
def mean(numbers):
return sum(numbers)/float(len(numbers))
#Calculate the standard deviation of a list of numbers
def stdev(numbers):
avg = mean(numbers)
variance = sum([(x-avg)**2 for x in numbers]) / float(len(numbers)-1)
return sqrt(variance)
#Calculate the mean, stdev and count for each column in a dataset
def summarize_dataset(dataset):
summaries = [(mean(column), stdev(column), len(column)) for
column in
zip(*dataset)]
del(summaries[-1])
return summaries
#Split dataset by class then calculate statistics for each row
def summarize_by_class(dataset):
separated = separate_by_class(dataset)
summaries = dict()
for class_value, rows in separated.items():
summaries[class_value] = summarize_dataset(rows)
return summaries
#Calculate the Gaussian probability distribution function for x
def calculate_probability(x, mean, stdev):
exponent = exp(-((x-mean)**2 / (2 * stdev**2 )))
return (1 / (sqrt(2 * pi) * stdev)) * exponent
#Calculate the probabilities of predicting each class for a given row
def calculate_class_probabilities(summaries, row):
total_rows = sum([summaries[label][0][2] for label in summaries])
probabilities = dict()
for class_value, class_summaries in summaries.items():
probabilities[class_value] =
summaries[class_value][0][2]/float(total_rows)
for i in range(len(class_summaries)):
mean, stdev, _ = class_summaries[i]
probabilities[class_value] *= calculate_probability(row[i], mean,
stdev)
Department of Computer Engineering
Experiment No.6
return probabilities
#Predict the class for a given row
def predict(summaries, row):
probabilities = calculate_class_probabilities(summaries, row)
best_label, best_prob = None, -1
for class_value, probability in probabilities.items():
if best_label is None or probability > best_prob:
best_prob = probability
best_label = class_value
return best_label
#Make a prediction with Naive Bayes on Iris Dataset
filename = 'D:\Python\iris.csv'
dataset = load_csv(filename)
for i in range(len(dataset[0])-1):
str_column_to_float(dataset, i)
#Convert class column to integers
str_column_to_int(dataset, len(dataset[0])-1)
#Fit model
model = summarize_by_class(dataset)
#Define a new record
row = [6.9, 3.2, 5.7, 2.3]
#Predict the label
label = predict(model, row)
print('Data='+str(row)+' Predicted='+str(label))

Output

Conclusion Running the data first summarizes the mapping of class labels to
integers and then fits the model on the entire dataset. There are
three class labels. 0,1 & 2. In the output, when a new observation is
defined, a class label is predicted. Here, our observation is predicted
as belonging to class 2 which is “Iris-virginica“

You might also like