0% found this document useful (0 votes)
29 views6 pages

Assignment 03

Uploaded by

DHRUV TILLU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views6 pages

Assignment 03

Uploaded by

DHRUV TILLU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Name: Dhruv Jayant Tillu Roll No.

: 6107
Subject: 510303 - BDA

ASSIGNMENT: 03
Aim: Perform Naïve Bayes & Linear Regression on individual dataset.

Requirements:
• Software: PyCharm Professional
• Libraries: PySpark Module
• Dataset: salary.csv and drug.csv from kaggle

Theory: Naive Bayes classification is a probabilistic algorithm based on Bayes' theorem, which assumes
independence among predictors. It is particularly effective for categorical data, making it suitable for tasks
like drug classification in medical datasets. The algorithm computes the probability of each class given the
input features, selecting the class with the highest probability as the predicted output. In this
implementation, categorical variables are transformed using label encoding, and feature scaling is applied to
ensure uniformity. The model's performance is evaluated using metrics such as accuracy, precision, recall,
and the confusion matrix, providing insights into its classification capabilities and effectiveness.

Linear regression is a statistical method used to model the relationship between a dependent variable (e.g., salary)
and one or more independent variables (e.g., years of experience). The technique assumes a linear relationship,
meaning the change in the dependent variable is proportional to changes in the independent variables. The model is
expressed in the form Y=a+bXY = a + bXY=a+bX, where YYY is the predicted value, aaa is the intercept, bbb is the
slope, and XXX represents the independent variable. The model is trained on a dataset to minimize errors, typically
using metrics such as Mean Absolute Error (MAE) and R² Score for evaluation.

Code for Naïve Bayes:


# Perform Naive Bayes Classification on Drug Dataset

import pandas as pd
from numpy.ma.core import shape
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

# load dataset
dataset = pd.read_csv('drug200.csv')
dataset

Age Sex BP Cholesterol Na_to_K Drug


0 23 F HIGH HIGH 25.355 DrugY
1 47 M LOW HIGH 13.093 drugC
2 47 M LOW HIGH 10.114 drugC
3 28 F NORMAL HIGH 7.798 drugX
4 61 F LOW HIGH 18.043 DrugY
.. ... .. ... ... ... ...
195 56 F LOW HIGH 11.567 drugC
196 16 M LOW HIGH 12.006 drugC
197 52 M NORMAL HIGH 9.894 drugX
198 23 M NORMAL NORMAL 14.020 drugX
199 40 F LOW NORMAL 11.349 drugX

[200 rows x 6 columns]


Name: Dhruv Jayant Tillu Roll No.: 6107
Subject: 510303 - BDA

# Transform Categorical Data using Label Encoding


from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
dataset['Sex'] = labelencoder.fit_transform(dataset['Sex'])
dataset['BP'] = labelencoder.fit_transform(dataset['BP'])
dataset['Cholesterol'] = labelencoder.fit_transform(dataset['Cholesterol'])
dataset['Drug'] = labelencoder.fit_transform(dataset['Drug'])
dataset

Age Sex BP Cholesterol Na_to_K Drug


0 23 0 0 0 25.355 0
1 47 1 1 0 13.093 3
2 47 1 1 0 10.114 3
3 28 0 2 0 7.798 4
4 61 0 1 0 18.043 0
.. ... ... .. ... ... ...
195 56 0 1 0 11.567 3
196 16 1 1 0 12.006 3
197 52 1 2 0 9.894 4
198 23 1 2 1 14.020 4
199 40 0 1 1 11.349 4

[200 rows x 6 columns]

# Splitting the dataset into the Training set and Test set
X = dataset.iloc[:, 0:5].values
y = dataset.iloc[:, 5].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,


random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

nvc = GaussianNB()
nvc.fit(X_train, y_train)

GaussianNB()

y_pred = nvc.predict(X_test)
y_pred

array([3, 4, 3, 0, 0, 4, 4, 4, 3, 4, 1, 0, 0, 0, 2, 3, 0, 0, 4, 1, 1, 4,
4, 4, 0, 0, 0, 0, 0, 4, 4, 3, 1, 4, 0, 0, 4, 3, 1, 4, 0, 1, 0, 0,
0, 4, 4, 0, 1, 2])

# Making the Confusion Matrix


from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True)

<Axes: >
Name: Dhruv Jayant Tillu Roll No.: 6107
Subject: 510303 - BDA

# Accuracy
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.84

# Precision
from sklearn.metrics import precision_score
print(precision_score(y_test, y_pred, average=None))

[0.94736842 0.71428571 0.5 0.5 0.9375 ]

# Recall
from sklearn.metrics import recall_score
print(recall_score(y_test, y_pred, average=None))

[0.72 1. 1. 1. 0.9375]

# Perform visualization
import matplotlib.pyplot as plt
dataset = pd.read_csv('drug200.csv')
plt.bar(dataset['Age'], dataset['BP'])
plt.show()

plt.bar(dataset['Age'], dataset['Cholesterol'])
plt.show()
Name: Dhruv Jayant Tillu Roll No.: 6107
Subject: 510303 - BDA

Code for Linear Regression:


# Perform Linear Regression on Salary Dataset

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error

dataset = pd.read_csv('Salary_dataset.csv')
dataset

Unnamed: 0 YearsExperience Salary


0 0 1.2 39344.0
1 1 1.4 46206.0
2 2 1.6 37732.0
3 3 2.1 43526.0
4 4 2.3 39892.0
5 5 3.0 56643.0
6 6 3.1 60151.0
7 7 3.3 54446.0
8 8 3.3 64446.0
9 9 3.8 57190.0
10 10 4.0 63219.0
11 11 4.1 55795.0
12 12 4.1 56958.0
13 13 4.2 57082.0
14 14 4.6 61112.0
15 15 5.0 67939.0
16 16 5.2 66030.0
17 17 5.4 83089.0
18 18 6.0 81364.0
19 19 6.1 93941.0
20 20 6.9 91739.0
21 21 7.2 98274.0
22 22 8.0 101303.0
23 23 8.3 113813.0
24 24 8.8 109432.0
25 25 9.1 105583.0
26 26 9.6 116970.0
27 27 9.7 112636.0
Name: Dhruv Jayant Tillu Roll No.: 6107
Subject: 510303 - BDA

28 28 10.4 122392.0
29 29 10.6 121873.0

# Perform Visualization
plt.bar(dataset['YearsExperience'], dataset['Salary'])

<BarContainer object of 30 artists>

plt.scatter(dataset['YearsExperience'], dataset['Salary'])

<matplotlib.collections.PathCollection at 0x25c02ae3700>

# Drop first column


X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values

# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3,
random_state = 0)

# Fitting Simple Linear Regression to the Training set


regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression()

# Predicting the Test set results


y_pred = regressor.predict(X_test)

# Performance Metrics
print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred))
Name: Dhruv Jayant Tillu Roll No.: 6107
Subject: 510303 - BDA

print('Mean Squared Error:', mean_squared_error(y_test, y_pred))


print('Accuracy:', regressor.score(X_test, y_test))

Mean Absolute Error: 1.7541523789077474e-15


Mean Squared Error: 4.067564042545842e-30
Accuracy: 1.0

Conclusion: In conclusion, the Naive Bayes classification on the drug dataset showcases its effectiveness in
predicting drug categories based on patient attributes. By utilizing label encoding and feature scaling, the
model was trained and evaluated using metrics like accuracy, precision, and recall. The results highlight Naive
Bayes as a suitable choice for classification tasks in medical datasets, with potential for further optimization
through algorithm exploration and hyperparameter tuning.
Linear regression serves as a fundamental tool in predictive analytics, providing a straightforward method to
quantify relationships between variables. By fitting a linear model to the data, we can make informed
predictions, such as estimating salaries based on years of experience. Evaluating the model with metrics like
Mean Absolute Error (MAE) and R² Score ensures its effectiveness and accuracy. This approach not only aids
in understanding trends within the data but also facilitates decision-making across various fields, making it a
valuable asset for data analysis and interpretation.

You might also like