ML Lab 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

ml-lab-4

July 19, 2024

1 ML LAb-4
1.1 ROHITH KUMAR B-2382487
1.1.1 Naive Bayes (NB) algorithm
To identify the suitable type of Naive Bayes (NB) algorithm for each dataset, we need to understand
the nature of the data and how different types of NB algorithms work. Here’s a brief overview of
different Naive Bayes algorithms:

Gaussian Naive Bayes: Suitable for continuous data that follows a Gaussian (normal) distri-
bution.

Multinomial Naive Bayes: Suitable for discrete data and is often used for text classification
where the features represent term frequencies.

Bernoulli Naive Bayes: Suitable for binary/boolean features.


Let’s perform the analysis step by step for each dataset.

Dataset 1: Mushroom Dataset


[1]: import pandas as pd

# Load the dataset


url = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/
↪agaricus-lepiota.data"

column_names = [
'class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape',
'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring',
'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type',␣
↪'veil-color',

'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat'


]
data = pd.read_csv(url, header=None, names=column_names)
data.head()

1
[1]: class cap-shape cap-surface cap-color bruises odor gill-attachment \
0 p x s n t p f
1 e x s y t a f
2 e b s w t l f
3 p x y w t p f
4 e x s g f n f

gill-spacing gill-size gill-color … stalk-surface-below-ring \


0 c n k … s
1 c b k … s
2 c b n … s
3 c n n … s
4 w b k … s

stalk-color-above-ring stalk-color-below-ring veil-type veil-color \


0 w w p w
1 w w p w
2 w w p w
3 w w p w
4 w w p w

ring-number ring-type spore-print-color population habitat


0 o p k s u
1 o p n n g
2 o p n n m
3 o p k s u
4 o e n a g

[5 rows x 23 columns]

The Mushroom dataset contains categorical data. For such data, Multinomial Naive Bayes or
Bernoulli Naive Bayes would be appropriate. However, since the features are not binary, Multino-
mial Naive Bayes is more suitable.

Dataset 2: Iris Dataset


[2]: import pandas as pd
from sklearn.datasets import load_iris

# Load the dataset


iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target
df.head()

[2]: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2

2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

target
0 0
1 0
2 0
3 0
4 0

The Iris dataset contains continuous data. Therefore, Gaussian Naive Bayes is suitable for this
dataset because it assumes that the features follow a normal distribution.

Dataset 3: SMS Spam Collection Dataset


[3]: import pandas as pd

# Load the dataset


url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/
↪data/sms.tsv"

data = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])


data.head()

[3]: label message


0 ham Go until jurong point, crazy.. Available only …
1 ham Ok lar… Joking wif u oni…
2 spam Free entry in 2 a wkly comp to win FA Cup fina…
3 ham U dun say so early hor… U c already then say…
4 ham Nah I don't think he goes to usf, he lives aro…

The SMS Spam Collection dataset is a text classification problem. For such data, Multinomial Naive
Bayes is suitable because it works well with discrete features like word counts or term frequencies.

Analysis and Justification Let’s implement and analyze each dataset with the appropriate
Naive Bayes algorithm.

Mushroom Dataset with Multinomial Naive Bayes


[5]: import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Load the dataset


url = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/
↪agaricus-lepiota.data"

3
column_names = [
'class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape',
'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring',
'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type',␣
↪'veil-color',

'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat'


]
data = pd.read_csv(url, header=None, names=column_names)
print(data.head()) # Check if the data is loaded correctly

# Encode categorical features


le = LabelEncoder()
encoded_data = data.apply(le.fit_transform)

# Verify the columns after encoding


print(encoded_data.columns)

# Split the data


X = encoded_data.drop('class', axis=1)
y = encoded_data['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

# Train the model


model = MultinomialNB()
model.fit(X_train, y_train)

# Predict and evaluate


y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

class cap-shape cap-surface cap-color bruises odor gill-attachment \


0 p x s n t p f
1 e x s y t a f
2 e b s w t l f
3 p x y w t p f
4 e x s g f n f

gill-spacing gill-size gill-color … stalk-surface-below-ring \


0 c n k … s
1 c b k … s
2 c b n … s
3 c n n … s
4 w b k … s

4
stalk-color-above-ring stalk-color-below-ring veil-type veil-color \
0 w w p w
1 w w p w
2 w w p w
3 w w p w
4 w w p w

ring-number ring-type spore-print-color population habitat


0 o p k s u
1 o p n n g
2 o p n n m
3 o p k s u
4 o e n a g

[5 rows x 23 columns]
Index(['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
'stalk-surface-below-ring', 'stalk-color-above-ring',
'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number',
'ring-type', 'spore-print-color', 'population', 'habitat'],
dtype='object')
Accuracy: 0.8073846153846154

Iris Dataset with Gaussian Naive Bayes


[8]: from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Split the data


X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

# Train the model


model = GaussianNB()
model.fit(X_train, y_train)

# Predict and evaluate


y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy

[8]: 1.0

SMS Spam Collection Dataset with Multinomial Naive Bayes

5
[10]: import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Load the dataset


url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/
↪data/sms.tsv"

data = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

# Convert text data to term frequency vectors


vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['message'])
y = data['label'].map({'ham': 0, 'spam': 1})

# Split the data


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

# Train the model


model = MultinomialNB()
model.fit(X_train, y_train)

# Predict and evaluate


y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9856502242152466

1.1.2 Conclusion
Mushroom Dataset: Multinomial Naive Bayes is suitable because the data is categorical. The
accuracy of the model should be high due to the distinct categorical nature of the features.

Iris Dataset: Gaussian Naive Bayes is appropriate because the data is continuous and follows a
normal distribution. The model should perform well with this assumption.

SMS Spam Collection Dataset: Multinomial Naive Bayes is suitable because the data is text-
based and works well with term frequencies. The accuracy of the model should be high for text
classification.

You might also like