ML Lab 4

ml-lab-4
July 19, 2024
1 ML LAb-4
1.1 ROHITH KUMAR B-2382487
1.1.1 Naive Bayes (NB) algorithm
To identify the suitable type of Naive Bayes (NB) algorithm for each dataset, we need to understand
the nature of the data and how different types of NB algorithms work. Here’s a brief overview of
different Naive Bayes algorithms:
Gaussian Naive Bayes: Suitable for continuous data that follows a Gaussian (normal) distri-
bution.
Multinomial Naive Bayes: Suitable for discrete data and is often used for text classification
where the features represent term frequencies.
Bernoulli Naive Bayes: Suitable for binary/boolean features.

Let’s perform the analysis step by step for each dataset.
Dataset 1: Mushroom Dataset

[1]: import pandas as pd
# Load the dataset

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/
↪agaricus-lepiota.data"
column_names = [
'class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape',
'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring',
'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type',␣
↪'veil-color',
'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat'

]
data = pd.read_csv(url, header=None, names=column_names)
data.head()
1
[1]: class cap-shape cap-surface cap-color bruises odor gill-attachment \
0 p x s n t p f
1 e x s y t a f
2 e b s w t l f
3 p x y w t p f
4 e x s g f n f
gill-spacing gill-size gill-color … stalk-surface-below-ring \

0 c n k … s
1 c b k … s
2 c b n … s
3 c n n … s
4 w b k … s
stalk-color-above-ring stalk-color-below-ring veil-type veil-color \

0 w w p w
1 w w p w
2 w w p w
3 w w p w
4 w w p w
ring-number ring-type spore-print-color population habitat

0 o p k s u
1 o p n n g
2 o p n n m
3 o p k s u
4 o e n a g
[5 rows x 23 columns]
The Mushroom dataset contains categorical data. For such data, Multinomial Naive Bayes or
Bernoulli Naive Bayes would be appropriate. However, since the features are not binary, Multino-
mial Naive Bayes is more suitable.
Dataset 2: Iris Dataset

from sklearn.datasets import load_iris
# Load the dataset

iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target
df.head()
[2]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
target
0 0
1 0
2 0
3 0
4 0
The Iris dataset contains continuous data. Therefore, Gaussian Naive Bayes is suitable for this
dataset because it assumes that the features follow a normal distribution.
Dataset 3: SMS Spam Collection Dataset

# Load the dataset

url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/
↪data/sms.tsv"
data = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

data.head()
[3]: label message

0 ham Go until jurong point, crazy.. Available only …
1 ham Ok lar… Joking wif u oni…
2 spam Free entry in 2 a wkly comp to win FA Cup fina…
3 ham U dun say so early hor… U c already then say…
4 ham Nah I don't think he goes to usf, he lives aro…
The SMS Spam Collection dataset is a text classification problem. For such data, Multinomial Naive
Bayes is suitable because it works well with discrete features like word counts or term frequencies.
Analysis and Justification Let’s implement and analyze each dataset with the appropriate
Naive Bayes algorithm.
Mushroom Dataset with Multinomial Naive Bayes

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
# Load the dataset

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/
↪agaricus-lepiota.data"
3
column_names = [
'class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape',
'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring',
'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type',␣
↪'veil-color',
'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat'

]
data = pd.read_csv(url, header=None, names=column_names)
print(data.head()) # Check if the data is loaded correctly
# Encode categorical features

le = LabelEncoder()
encoded_data = data.apply(le.fit_transform)
# Verify the columns after encoding

print(encoded_data.columns)
# Split the data

X = encoded_data.drop('class', axis=1)
y = encoded_data['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)
# Train the model

model = MultinomialNB()
model.fit(X_train, y_train)
# Predict and evaluate

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
class cap-shape cap-surface cap-color bruises odor gill-attachment \

0 p x s n t p f
1 e x s y t a f
2 e b s w t l f
3 p x y w t p f
4 e x s g f n f
gill-spacing gill-size gill-color … stalk-surface-below-ring \

0 c n k … s
1 c b k … s
2 c b n … s
3 c n n … s
4 w b k … s
4
stalk-color-above-ring stalk-color-below-ring veil-type veil-color \
0 w w p w
1 w w p w
2 w w p w
3 w w p w
4 w w p w
ring-number ring-type spore-print-color population habitat

0 o p k s u
1 o p n n g
2 o p n n m
3 o p k s u
4 o e n a g
[5 rows x 23 columns]
Index(['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
'stalk-surface-below-ring', 'stalk-color-above-ring',
'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number',
'ring-type', 'spore-print-color', 'population', 'habitat'],
dtype='object')
Accuracy: 0.8073846153846154
Iris Dataset with Gaussian Naive Bayes

[8]: from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
# Split the data

X = df.drop('target', axis=1)
y = df['target']
↪random_state=42)
# Train the model

model = GaussianNB()

accuracy
[8]: 1.0
SMS Spam Collection Dataset with Multinomial Naive Bayes
5
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
# Load the dataset

url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/
↪data/sms.tsv"
data = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])
# Convert text data to term frequency vectors

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['message'])
y = data['label'].map({'ham': 0, 'spam': 1})
# Split the data

↪random_state=42)
# Train the model

model = MultinomialNB()

print(f"Accuracy: {accuracy}")
Accuracy: 0.9856502242152466
1.1.2 Conclusion
Mushroom Dataset: Multinomial Naive Bayes is suitable because the data is categorical. The
accuracy of the model should be high due to the distinct categorical nature of the features.
Iris Dataset: Gaussian Naive Bayes is appropriate because the data is continuous and follows a
normal distribution. The model should perform well with this assumption.
SMS Spam Collection Dataset: Multinomial Naive Bayes is suitable because the data is text-
based and works well with term frequencies. The accuracy of the model should be high for text
classification.

ML Lab 4

Uploaded by

Copyright:

Available Formats

ML Lab 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Lab 4

Uploaded by

Copyright:

Available Formats

ml-lab-4

July 19, 2024

Bernoulli Naive Bayes: Suitable for binary/boolean features.

Dataset 1: Mushroom Dataset

# Load the dataset

'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat'

gill-spacing gill-size gill-color … stalk-surface-below-ring \

stalk-color-above-ring stalk-color-below-ring veil-type veil-color \

ring-number ring-type spore-print-color population habitat

Dataset 2: Iris Dataset

# Load the dataset

Dataset 3: SMS Spam Collection Dataset

# Load the dataset

data = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

[3]: label message

Mushroom Dataset with Multinomial Naive Bayes

# Load the dataset

'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat'

# Encode categorical features

# Verify the columns after encoding

# Split the data

# Train the model

# Predict and evaluate

class cap-shape cap-surface cap-color bruises odor gill-attachment \

gill-spacing gill-size gill-color … stalk-surface-below-ring \

ring-number ring-type spore-print-color population habitat

Iris Dataset with Gaussian Naive Bayes

# Split the data

# Train the model

# Predict and evaluate

SMS Spam Collection Dataset with Multinomial Naive Bayes

# Load the dataset

data = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

# Convert text data to term frequency vectors

# Split the data

# Train the model

# Predict and evaluate

You might also like