0% found this document useful (0 votes)
14 views20 pages

Data Science Record - 05

The document outlines experiments for data exploration, preprocessing, linear regression, logistic regression, and Naive Bayes classification using Python. It includes algorithms, code implementations, and results for each experiment, demonstrating data handling and model evaluation techniques. The experiments aim to provide practical applications of machine learning methods on datasets.

Uploaded by

Deepak Sathis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views20 pages

Data Science Record - 05

The document outlines experiments for data exploration, preprocessing, linear regression, logistic regression, and Naive Bayes classification using Python. It includes algorithms, code implementations, and results for each experiment, demonstrating data handling and model evaluation techniques. The experiments aim to provide practical applications of machine learning methods on datasets.

Uploaded by

Deepak Sathis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

EXP.

NO: 01
PERFORM DATA EXPLORATION AND PREPROCESSING
DATE: 23.01.2025

AIM:
To write a python code that will perform data exploration and preprocessing for the
uploaded dataset.
ALGORITHM:
Step 1: Start the program
Step 2: Import the necessary python libraries
Step 3: Load the data set in the current file directory
Step 4: Perform data exploration and data preprocessing for the loaded dataset
Step 5: Display the output
Step 6: Stop the program
CODE:
import pandas as pd
pd.set_option("display.max_rows", None, "display.max_columns", None, "display.width",
None)
file_path = '/content/traffic_accidects.csv'
df = pd.read_csv(file_path)
print("First few rows of the dataset:")
print(df.head())
print("First few rows of the dataset:")
print(df.head())
print("\nSummary Statistics:")
print(df.describe(include="all"))
print("\nMissing Values:")
print(df.isnull().sum())
if 'Age' in df.columns:
df['Age'] = df['Age'].fillna(df['Age'].median())
if 'Salary' in df.columns:
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())
if 'AccidentDate' in df.columns:
df['AccidentDate'] = df['AccidentDate'].fillna("Unknown")
df['AccidentDate'] = pd.to_datetime(df['AccidentDate'], errors='coerce')
if 'Gender' in df.columns:
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})
if 'SeverityScore' in df.columns:
df = df.dropna(subset=['SeverityScore'])
if 'AccidentDate' in df.columns:
current_year = pd.Timestamp.now().year
df['YearsSinceAccident'] = current_year - df['AccidentDate'].dt.year
if 'Salary' in df.columns:
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df = df[(df['Salary'] >= lower_bound) & (df['Salary'] <= upper_bound)]
print("\nCleaned Dataset:")
print(df.head().to_string())
OUTPUT:
Particulars Marks Allotted Marks Awarded

Program / Simulation 40

Program Execution 30

Result 20

Viva Voce 10

Total 100

RESULT:
Thus, a program for data exploration and preprocessing has been successfully
executed.

EXP.NO: 02 (a)

DATE: 30.01.2025 Implement linear and logistic regression

1). Linear regression:


a). Single linear regression:
AIM:
To write a python code for the implementation of single linear regression to find a
straight line that goes through data points.
ALGORITHM:
Step 1: Start the program
Step 2: Import the necessary python libraries
Step 3: load the datasets and using formula build a code for single linear regression
Step 4: Display the output
Step 5: Stop the program
CODE:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
file_path = "/content/datasets/house_prices.csv"
df = pd.read_csv(file_path)
df = df.drop(columns=['id', 'date'])
df['bedrooms'] = df['bedrooms'].fillna(df['bedrooms'].median())
df['bathrooms'] = df['bathrooms'].fillna(df['bathrooms'].median())
df['sqft_living'] = df['sqft_living'].fillna(df['sqft_living'].median())
df['sqft_lot'] = df['sqft_lot'].fillna(df['sqft_lot'].median())
df['waterfront'] = df['waterfront'].fillna(df['waterfront'].mode()[0])
df['view'] = df['view'].fillna(df['view'].mode()[0])
df['condition'] = df['condition'].fillna(df['condition'].mode()[0])
df['grade'] = df['grade'].fillna(df['grade'].mode()[0])
df['age_of_house'] = 2025 - df['yr_built']
df['time_since_renovation'] = 2025 - df['yr_renovated']
df['time_since_renovation'] = df['time_since_renovation'].where(df['yr_renovated'] != 0,
0)
df['total_sqft'] = df['sqft_living'] + df['sqft_basement']
df['bedrooms_sqft'] = df['bedrooms'] * df['sqft_living']
df = pd.get_dummies(df, columns=['waterfront', 'view', 'condition', 'zipcode'],
drop_first=True)
X = df.drop(columns=['price'])
y = df['price']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2,
random_state=42)
ridge_model = Ridge(alpha=1.0)
lasso_model = Lasso(alpha=0.1)
ridge_model.fit(X_train, y_train)
lasso_model.fit(X_train, y_train)
y_pred_ridge = ridge_model.predict(X_test)
y_pred_lasso = lasso_model.predict(X_test)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
rmse_ridge = np.sqrt(mse_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
rmse_lasso = np.sqrt(mse_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)
print("Ridge Regression Model Evaluation:")
print(f"MSE: {mse_ridge:.2f}")
print(f"RMSE: {rmse_ridge:.2f}")
print(f"R-squared: {r2_ridge:.2f}")
print("\nLasso Regression Model Evaluation:")
print(f"MSE: {mse_lasso:.2f}")
print(f"RMSE: {rmse_lasso:.2f}")
print(f"R-squared: {r2_lasso:.2f}")
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_ridge, color='blue', alpha=0.6, label="Ridge")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)
plt.title('Actual vs Predicted Housing Prices (Ridge Regression)', fontsize=16)
plt.xlabel('Actual Housing Price', fontsize=14)
plt.ylabel('Predicted Housing Price', fontsize=14)
plt.legend()
plt.grid()
plt.show()
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_lasso, color='green', alpha=0.6, label="Lasso")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)
plt.title('Actual vs Predicted Housing Prices (Lasso Regression)', fontsize=16)
plt.xlabel('Actual Housing Price', fontsize=14)
plt.ylabel('Predicted Housing Price', fontsize=14)
plt.legend()
plt.grid()
plt.show()

OUTPUT:

b). Multi linear regression:


AIM:
To write a python code for the implementation of multi linear regression to find a
straight line that goes through data points.
ALGORITHM:
Step 1: Start the program
Step 2: Import the necessary python libraries
Step 3: load the datasets and using formula build a code for multi linear regression
Step 4: Display the output
Step 5: Stop the program
CODE:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
file_path = "/content/house_prices.csv"
df = pd.read_csv(file_path)
price_threshold = 500000
df['price_above_threshold'] = (df['price'] > price_threshold).astype(int)
categorical_columns = ['waterfront', 'view', 'condition', 'grade', 'zipcode']
df = pd.get_dummies(df, columns=categorical_columns, drop_first=True)
features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'sqft_above',
'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15',
'sqft_lot15'] + [col for col in df.columns if
col.startswith(tuple(categorical_columns))]
X = df[features]
y = df['price_above_threshold']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=["Below",
"Above"], yticklabels=["Below", "Above"])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title(f'Confusion Matrix (Accuracy: {accuracy:.2f})')
plt.show()
coefficients = model.coef_[0]
intercept = model.intercept_[0]
coeff_df = pd.DataFrame({'Feature': features, 'Coefficient':
coefficients}).sort_values(by='Coefficient', ascending=False)
accuracy, conf_matrix, coeff_df.head(10), intercept

OUTPUT:
EXP.NO: 02 (b)
DATE: 30.01.2025 Implement linear and logistic regression

2). Logistic regression:


AIM:
To write a python code for the implementation of logistic regression to find a sigmoid
that goes through data points.
ALGORITHM:
Step 1: Start the program
Step 2: Import the necessary python libraries
Step 3: load the datasets and using formula build a code for logistic regression
Step 4: Display the output
Step 5: Stop the program

CODE:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
iris = load_iris()
X = iris.data[:, 0].reshape(-1, 1)
y = (iris.target == 0).astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
def sigmoid(z): return 1 / (1 + np.exp(-z))
def gradient_descent(X, y, theta, lr=0.01, iters=1000):
for _ in range(iters):
theta -= lr * (X.T @ (sigmoid(X @ theta) - y)) / len(y)
return theta
theta = np.zeros(X_train_poly.shape[1])
theta_optimal = gradient_descent(X_train_poly, y_train, theta)
predictions = sigmoid(X_test_poly @ theta_optimal) >= 0.5
accuracy = np.mean(predictions == y_test)
print(f"Accuracy: {accuracy * 100:.2f}%")
x_values = np.linspace(X_train.min(), X_train.max(), 100).reshape(-1, 1)
x_poly = poly.transform(x_values)
y_values = sigmoid(x_poly @ theta_optimal) >= 0.5
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.plot(x_values, y_values, color='red', label='Decision Boundary')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Setosa (1) vs Not Setosa (0)')
plt.title('Logistic Regression with Curved Decision Boundary')
plt.legend()
plt.grid(True)
plt.show()

OUTPUT:
Particulars Marks Allotted Marks Awarded

Program / Simulation 40

Program Execution 30

Result 20

Viva Voce 10

Total 100

RESULT:
Thus, a program for linear and logistic regression has been successfully executed.

EXP.NO: 03

DATE: 06.02.2025 Naive bayes classifier

AIM:
To write a Python code for the implementation of the Naive Bayes Classifier for classifying
data based on probability distributions.
ALGORITHM:
Step 1: Start the program
Step 2: Import the necessary Python libraries
Step 3: Load the dataset and preprocess the data
Step 4: Compute the prior probabilities and likelihood using Bayes' theorem
Step 5: Build the Naïve Bayes classifier and train it on the dataset
Step 6: Use the trained model to make predictions
Step 7: Display the output
Step 8: Stop the program
CODE:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
class NaiveBayesClassifier:
def __init__(self):
self.class_priors = {}
self.means = {}
self.variances = {}
self.classes = None

def fit(self, X, y):


self.classes = np.unique(y)
for c in self.classes:
X_c = X[y == c]
self.class_priors[c] = len(X_c) / len(X)
self.means[c] = np.mean(X_c, axis=0)
self.variances[c] = np.var(X_c, axis=0) + 1e-9
def gaussian_pdf(self, x, mean, variance):
coeff = 1 / np.sqrt(2 * np.pi * variance)
exponent = np.exp(-((x - mean) ** 2) / (2 * variance))
return coeff * exponent
def predict(self, X):
predictions = []
for x in X:
posteriors = {}
for c in self.classes:
prior = np.log(self.class_priors[c])
likelihood = np.sum(np.log(self.gaussian_pdf(x, self.means[c], self.variances[c])))
posteriors[c] = prior + likelihood
predictions.append(max(posteriors, key=posteriors.get))
return np.array(predictions)
df = pd.read_csv('/content/house_prices2.csv')
for col in df.columns:
if df[col].dtype == 'object':
df[col] = pd.factorize(df[col])[0]
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
nb = NaiveBayesClassifier()
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=np.unique(y),
yticklabels=np.unique(y))
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix for Naïve Bayes Classifier")
plt.show()
test_sizes = np.linspace(0.1, 0.5, 5)
accuracies = []
for test_size in test_sizes:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size,
random_state=42)
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
accuracies.append(accuracy_score(y_test, y_pred))
plt.figure(figsize=(7, 5))
plt.plot(test_sizes, accuracies, marker='o', linestyle='-', color='m', label="Naïve Bayes
Accuracy")
plt.xlabel("Test Size")
plt.ylabel("Accuracy")
plt.title("Naïve Bayes Accuracy vs. Test Size")
plt.legend()
plt.grid()
plt.show()
OUTPUT:
Particulars Marks Allotted Marks Awarded

Program / Simulation 40

Program Execution 30

Result 20

Viva Voce 10

Total 100

RESULT:

The required Naïve bayes model has been executed successfull

EXP.NO: 04
AIM:
DATE: 13.03.2025 POWER BI

To make an analytical dashboard for E-commerce

ALGORITHM:
STEP 1: Load Data - Import dataset into Power BI using Get Data and

load it into a table.

STEP 2: Preprocess Data - Handle missing values, encode Gender,

compute Experience.

STEP 3: Define Variables - Create binary target variable

SalaryAbove50K, set X and y.

STEP 4: Split Data - Divide features and target into training

and testing sets.

STEP 5: Train Model - Train Naïve Bayes model using Power

BI AI Insights.

STEP 6: Evaluate and Visualize - Compute accuracy, generate confusion matrix, plot
ROC curve.

OUTPUT:
MARK ALLOCATION:

Particulars Marks Allotted Marks Awarded

Program / Simulation 40

Program Execution 30

Result 20

Viva Voce 10

Total 100

RESULT:
Thus, the zomato sales dataset has been successfully visualized using a PowerBI
dashboard.

You might also like