Lab08 ML
Lab08 ML
LAB No: 08
ID NO: F20603040
Lab 08: Linear Regression and Data Preprocessing on Supermarket Sales Dataset
Objective:
Understand and implement data preprocessing techniques, including log transformation and one-
hot encoding.
Apply Exploratory Data Analysis (EDA) to gain insights into the dataset.
Explore linear regression for predicting sales in the supermarket dataset.
Tools/Software Requirements:
Python 3.x
Jupyter Notebook or any other Python IDE
Pandas, NumPy for data manipulation
Matplotlib, Seaborn for data visualization
Scikit-learn for linear regression modeling
Sample supermarket sales dataset (cleaned) in CSV format
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Linear Regression
from sklearn.preprocessing import OneHotEncoder, StandardScaler
file_path = 'C:\\Users\\us\\Desktop\\cleaned\\
superstore_final_dataset_cleaned.csv' # Replace with your own file path
dataset = pd.read_csv(file_path, encoding='ISO-8859-1')
# ISO-8859-1 and cp1252 cover most scenarios.
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab
#Converting to datetime
dataset['Order_Date'] = pd.to_datetime(dataset['Order_Date'],
errors='coerce', dayfirst=True)
dataset['Ship_Date'] = pd.to_datetime(dataset['Ship_Date'],
errors='coerce', dayfirst=True)
Feature Engineering:
1. Create new features like 'Date_Gap' and apply one-hot encoding to categorical variables.
2. Apply a logarithmic transformation to the 'Sales' variable to address skewness.
encoded_categorical_df = pd.DataFrame(encoded_categorical_data,
columns=encoder.get_feature_names(['Segment', 'Country', 'City',
'State', 'Category']))
# Merging encoded data with the original dataset and dropping original
#categorical columns
dataset = dataset.join(encoded_categorical_df).drop(['Segment',
'Country', 'City', 'State', 'Category'], axis=1)
plt.tight_layout()
plt.show()
# Splitting the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Training a linear regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
Lab Task
1. Perform data preprocessing on the supermarket sales dataset, including feature
engineering and log transformation.
2. Conduct EDA to understand the relationships in the dataset.
3. Build and evaluate a linear regression model to predict sales.
CONCLUSION