MLT Lab 06
MLT Lab 06
Practical-06
AIM - Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier model to
perform this task. Built-in Java classes/API can be used to write the program. Calculate the accuracy,
precision, and recall for your data set
Theory:- The Naïve Bayesian Classifier is a probabilis c machine learning model used for text classifica on
tasks, such as spam detec on or sen ment analysis. It is based on Bayes' Theorem, with the "naïve" assump
on that all features (words in a document) are independent of each other given the class label. Despite this
simplifica on, it performs remarkably well in prac cal applica ons.
Key Concepts:
Bayes’ Theorem:
It provides a way to calculate the probability of a hypothesis given the evidence.
How the Naïve Bayesian Classifier Works for Document Classifica on:
1. Preprocess the Text:
Convert documents into tokens (words), remove stopwords, and vectorize the data using techniques
like Bag of Words or TF-IDF.
2. Training Phase:
Use the training documents and their labels to calculate the prior and likelihood probabili es for
each class.
3. Predic on Phase:
For a new/unseen document, compute the posterior probability for each class, and assign the class
with the highest probability.
4. Evalua on:
Use metrics such as Accuracy, Precision, and Recall to evaluate model performance.
• The features (words) are condi onally independent given the class.
Source Code :-
import pandas as pd
msg = pd.read_csv('/content/sample_data/document.csv', names=['message', 'label'])
print("Total Instances of Dataset: ", msg.shape[0]) msg['labelnum'] =
msg.label.map({'pos': 1, 'neg': 0})
X = msg.message
y = msg.labelnum
from sklearn.model_selec on import train_test_split Xtrain,
Xtest, ytrain, ytest = train_test_split(X, y)
from sklearn.feature_extrac on.text import CountVectorizer
count_v = CountVectorizer()
Xtrain_dm = count_v.fit_transform(Xtrain)
Xtest_dm = count_v.transform(Xtest)
……………………………………………………………………………
df = pd.DataFrame(Xtrain_dm.toarray(), columns=count_v.get_feature_names_out())
print(df[0:5])
from sklearn.naive_bayes import Mul nomialNB clf = Mul nomialNB()
clf.fit(Xtrain_dm, ytrain)
pred = clf.predict(Xtest_dm)
…………………………………………………
………………………… for doc, p in
zip(Xtrain, pred): p = 'pos' if p == 1 else 'neg'
print("%s -> %s" % (doc, p))
from
sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score
print('Accuracy Metrics: \n') print('Accuracy: ', accuracy_score(ytest, pred)) print('Recall: ',
recall_score(ytest, pred)) print('Precision: ', precision_score(ytest, pred))
print('Confusion Matrix: \n', confusion_matrix(ytest, pred))