0% found this document useful (0 votes)
10 views

Lab08 ML

Uploaded by

akbarmughal2824
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lab08 ML

Uploaded by

akbarmughal2824
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

National University of Technology (NUTECH)

Electrical Engineering Department


EE4407-Machine Learning Lab

LAB No: 08

NAME: Muhmmad Ahmed Mustafa

ID NO: F20603040

Lab 08: Linear Regression and Data Preprocessing on Supermarket Sales Dataset

Objective:

 Understand and implement data preprocessing techniques, including log transformation and one-
hot encoding.
 Apply Exploratory Data Analysis (EDA) to gain insights into the dataset.
 Explore linear regression for predicting sales in the supermarket dataset.

Tools/Software Requirements:

 Python 3.x
 Jupyter Notebook or any other Python IDE
 Pandas, NumPy for data manipulation
 Matplotlib, Seaborn for data visualization
 Scikit-learn for linear regression modeling
 Sample supermarket sales dataset (cleaned) in CSV format

Data Preprocessing and Feature Engineering


Import Necessary Libraries and Load and Inspect the Dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Linear Regression
from sklearn.preprocessing import OneHotEncoder, StandardScaler

file_path = 'C:\\Users\\us\\Desktop\\cleaned\\
superstore_final_dataset_cleaned.csv' # Replace with your own file path
dataset = pd.read_csv(file_path, encoding='ISO-8859-1')
# ISO-8859-1 and cp1252 cover most scenarios.
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab

Necessary Data Preprocessing before feeding dataset to model.

#Converting to datetime
dataset['Order_Date'] = pd.to_datetime(dataset['Order_Date'],
errors='coerce', dayfirst=True)
dataset['Ship_Date'] = pd.to_datetime(dataset['Ship_Date'],
errors='coerce', dayfirst=True)

#Extracting year, month, and day from Order_Date


dataset['Order_Year'] = dataset['Order_Date'].dt.year
dataset['Order_Month'] = dataset['Order_Date'].dt.month
dataset['Order_Day'] = dataset['Order_Date'].dt.day

#Extracting year, month, and day from ship date


dataset['Ship_Year'] = dataset['Ship_Date'].dt.year
dataset['Ship_Month'] = dataset['Ship_Date'].dt.month
dataset['Ship_Day'] = dataset['Ship_Date'].dt.day

# Dropping the original datetime columns


dataset.drop(['Order_Date', 'Ship_Date'], axis=1, inplace=True)

# Dropping unnecessary columns


columns_to_drop = ['Row_ID', 'Order_ID', 'Product_ID', 'Customer_Name',
'Customer_ID', 'Product_Name']
dataset.drop(columns=columns_to_drop, axis=1, inplace=True)

Feature Engineering:
1. Create new features like 'Date_Gap' and apply one-hot encoding to categorical variables.
2. Apply a logarithmic transformation to the 'Sales' variable to address skewness.

# Encoding categorical variables


from sklearn.preprocessing import OneHotEncoder, StandardScaler
encoder = OneHotEncoder(sparse=False)
encoded_categorical_data = encoder.fit
_transform(dataset[['Segment', 'Country', 'City', 'State', 'Category']])
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab

encoded_categorical_df = pd.DataFrame(encoded_categorical_data,
columns=encoder.get_feature_names(['Segment', 'Country', 'City',
'State', 'Category']))

# Merging encoded data with the original dataset and dropping original
#categorical columns
dataset = dataset.join(encoded_categorical_df).drop(['Segment',
'Country', 'City', 'State', 'Category'], axis=1)

# Apply logarithmic transformation to the 'Sales' variable


# Adding 1 to avoid log(0) which is undefined
dataset['Log_Sales'] = np.log(dataset['Sales'] + 1)

# Display the first few rows of the updated dataset


print(dataset[['Sales', 'Log_Sales']].head())

# Visualization of the 'Sales' distribution and Log Transformed Sales Distribution


plt.figure(figsize=(12, 6))

# Original Sales Distribution


plt.subplot(1, 2, 1)
sns.distplot(dataset['Sales'], kde=True, bins=30)
plt.title('Original Sales Distribution')
plt.xlabel('Sales')
plt.ylabel('Frequency')

# Log Transformed Sales Distribution


plt.subplot(1, 2, 2)
sns.distplot(dataset['Log_Sales'], kde=True, bins=30)
plt.title('Log Transformed Sales Distribution')
plt.xlabel('Log(Sales)')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

Linear Regression Modeling


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Splitting the data into features (X) and target (y)


X = data.drop('Sales', axis=1) # Features
y = data['Sales'] # Target
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab

# Splitting the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Training a linear regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Predicting on the test set


y_pred = lin_reg.predict(X_test)

# Evaluating the model


mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)


print("R² Score:", r2)

Lab Task
1. Perform data preprocessing on the supermarket sales dataset, including feature
engineering and log transformation.
2. Conduct EDA to understand the relationships in the dataset.
3. Build and evaluate a linear regression model to predict sales.

CONCLUSION

from sklearn.model_selection import train_test_split


from sklearn.metrics import mean_squared_error, r2_score

# Features (X) and Target Variable (y)


X = dataset.drop(['Sales', 'Log_Sales'], axis=1) # Assuming 'Sales' is the target variable
y = dataset['Log_Sales']

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training a linear regression model


lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Predicting on the test set


y_pred = lin_reg.predict(X_test)
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab

# Evaluating the model


mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)


print("R² Score:", r2)
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab

You might also like