Assignment 03
Assignment 03
: 6107
Subject: 510303 - BDA
ASSIGNMENT: 03
Aim: Perform Naïve Bayes & Linear Regression on individual dataset.
Requirements:
• Software: PyCharm Professional
• Libraries: PySpark Module
• Dataset: salary.csv and drug.csv from kaggle
Theory: Naive Bayes classification is a probabilistic algorithm based on Bayes' theorem, which assumes
independence among predictors. It is particularly effective for categorical data, making it suitable for tasks
like drug classification in medical datasets. The algorithm computes the probability of each class given the
input features, selecting the class with the highest probability as the predicted output. In this
implementation, categorical variables are transformed using label encoding, and feature scaling is applied to
ensure uniformity. The model's performance is evaluated using metrics such as accuracy, precision, recall,
and the confusion matrix, providing insights into its classification capabilities and effectiveness.
Linear regression is a statistical method used to model the relationship between a dependent variable (e.g., salary)
and one or more independent variables (e.g., years of experience). The technique assumes a linear relationship,
meaning the change in the dependent variable is proportional to changes in the independent variables. The model is
expressed in the form Y=a+bXY = a + bXY=a+bX, where YYY is the predicted value, aaa is the intercept, bbb is the
slope, and XXX represents the independent variable. The model is trained on a dataset to minimize errors, typically
using metrics such as Mean Absolute Error (MAE) and R² Score for evaluation.
import pandas as pd
from numpy.ma.core import shape
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
# load dataset
dataset = pd.read_csv('drug200.csv')
dataset
# Splitting the dataset into the Training set and Test set
X = dataset.iloc[:, 0:5].values
y = dataset.iloc[:, 5].values
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
nvc = GaussianNB()
nvc.fit(X_train, y_train)
GaussianNB()
y_pred = nvc.predict(X_test)
y_pred
array([3, 4, 3, 0, 0, 4, 4, 4, 3, 4, 1, 0, 0, 0, 2, 3, 0, 0, 4, 1, 1, 4,
4, 4, 0, 0, 0, 0, 0, 4, 4, 3, 1, 4, 0, 0, 4, 3, 1, 4, 0, 1, 0, 0,
0, 4, 4, 0, 1, 2])
<Axes: >
Name: Dhruv Jayant Tillu Roll No.: 6107
Subject: 510303 - BDA
# Accuracy
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
0.84
# Precision
from sklearn.metrics import precision_score
print(precision_score(y_test, y_pred, average=None))
# Recall
from sklearn.metrics import recall_score
print(recall_score(y_test, y_pred, average=None))
[0.72 1. 1. 1. 0.9375]
# Perform visualization
import matplotlib.pyplot as plt
dataset = pd.read_csv('drug200.csv')
plt.bar(dataset['Age'], dataset['BP'])
plt.show()
plt.bar(dataset['Age'], dataset['Cholesterol'])
plt.show()
Name: Dhruv Jayant Tillu Roll No.: 6107
Subject: 510303 - BDA
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
dataset = pd.read_csv('Salary_dataset.csv')
dataset
28 28 10.4 122392.0
29 29 10.6 121873.0
# Perform Visualization
plt.bar(dataset['YearsExperience'], dataset['Salary'])
plt.scatter(dataset['YearsExperience'], dataset['Salary'])
<matplotlib.collections.PathCollection at 0x25c02ae3700>
# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3,
random_state = 0)
LinearRegression()
# Performance Metrics
print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred))
Name: Dhruv Jayant Tillu Roll No.: 6107
Subject: 510303 - BDA
Conclusion: In conclusion, the Naive Bayes classification on the drug dataset showcases its effectiveness in
predicting drug categories based on patient attributes. By utilizing label encoding and feature scaling, the
model was trained and evaluated using metrics like accuracy, precision, and recall. The results highlight Naive
Bayes as a suitable choice for classification tasks in medical datasets, with potential for further optimization
through algorithm exploration and hyperparameter tuning.
Linear regression serves as a fundamental tool in predictive analytics, providing a straightforward method to
quantify relationships between variables. By fitting a linear model to the data, we can make informed
predictions, such as estimating salaries based on years of experience. Evaluating the model with metrics like
Mean Absolute Error (MAE) and R² Score ensures its effectiveness and accuracy. This approach not only aids
in understanding trends within the data but also facilitates decision-making across various fields, making it a
valuable asset for data analysis and interpretation.