ML Lab 4
ML Lab 4
ML Lab 4
1 ML LAb-4
1.1 ROHITH KUMAR B-2382487
1.1.1 Naive Bayes (NB) algorithm
To identify the suitable type of Naive Bayes (NB) algorithm for each dataset, we need to understand
the nature of the data and how different types of NB algorithms work. Here’s a brief overview of
different Naive Bayes algorithms:
Gaussian Naive Bayes: Suitable for continuous data that follows a Gaussian (normal) distri-
bution.
Multinomial Naive Bayes: Suitable for discrete data and is often used for text classification
where the features represent term frequencies.
column_names = [
'class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape',
'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring',
'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type',␣
↪'veil-color',
1
[1]: class cap-shape cap-surface cap-color bruises odor gill-attachment \
0 p x s n t p f
1 e x s y t a f
2 e b s w t l f
3 p x y w t p f
4 e x s g f n f
[5 rows x 23 columns]
The Mushroom dataset contains categorical data. For such data, Multinomial Naive Bayes or
Bernoulli Naive Bayes would be appropriate. However, since the features are not binary, Multino-
mial Naive Bayes is more suitable.
[2]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
target
0 0
1 0
2 0
3 0
4 0
The Iris dataset contains continuous data. Therefore, Gaussian Naive Bayes is suitable for this
dataset because it assumes that the features follow a normal distribution.
The SMS Spam Collection dataset is a text classification problem. For such data, Multinomial Naive
Bayes is suitable because it works well with discrete features like word counts or term frequencies.
Analysis and Justification Let’s implement and analyze each dataset with the appropriate
Naive Bayes algorithm.
3
column_names = [
'class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape',
'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring',
'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type',␣
↪'veil-color',
4
stalk-color-above-ring stalk-color-below-ring veil-type veil-color \
0 w w p w
1 w w p w
2 w w p w
3 w w p w
4 w w p w
[5 rows x 23 columns]
Index(['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
'stalk-surface-below-ring', 'stalk-color-above-ring',
'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number',
'ring-type', 'spore-print-color', 'population', 'habitat'],
dtype='object')
Accuracy: 0.8073846153846154
[8]: 1.0
5
[10]: import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
Accuracy: 0.9856502242152466
1.1.2 Conclusion
Mushroom Dataset: Multinomial Naive Bayes is suitable because the data is categorical. The
accuracy of the model should be high due to the distinct categorical nature of the features.
Iris Dataset: Gaussian Naive Bayes is appropriate because the data is continuous and follows a
normal distribution. The model should perform well with this assumption.
SMS Spam Collection Dataset: Multinomial Naive Bayes is suitable because the data is text-
based and works well with term frequencies. The accuracy of the model should be high for text
classification.