0% found this document useful (0 votes)
19 views

Lab07ML - f40

Uploaded by

akbarmughal2824
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Lab07ML - f40

Uploaded by

akbarmughal2824
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

National University of Technology (NUTECH)

Electrical Engineering Department


EE4407-Machine Learning Lab

LAB No: 07

Name: Muhammad Ahmed Mustata

ID NO: F20603040

Lab 07: Data Preprocessing and Exploratory Data Analysis (EDA)

Objective:

 Familarize students how to standardize data, focusing on categorical data


standardization.
 Implement basic techniques of Exploratory Data Analysis (EDA).
 Use the supermarket sales dataset for practical application.

Tools/Software Requirements:

 Python 3.x
 Jupyter Notebook or any other Python IDE
 Pandas and NumPy libraries for data manipulation
 Matplotlib and Seaborn libraries for data visualization
 Sample supermarket sales dataset in CSV format

Data Standardization
Data standardization is about bringing consistency to your data, which is vital for
accurate analysis and interpretation. Specifically, we'll focus on standardizing categorical
data, a common challenge in real-world datasets.
Categorical data often contains variations in formatting, such as differing capitalizations
or typographical errors, which can lead to incorrect categorization. For instance, 'First
Class' and 'first class' might be treated as two distinct categories by data analysis tools,
even though they represent the same class of data. Such discrepancies can significantly
skew your analysis, leading to unreliable outcomes.
Our goal in this lab is to learn how to identify and correct these inconsistencies. We will
use Python and its powerful libraries to standardize categorical data, ensuring that our
dataset is clean, consistent, and ready for further analysis or machine learning tasks.
This step is essential in the data preprocessing phase and lays the foundation for any
subsequent data analysis or modeling.

Importance of Consistent Format for Categorical Data


Why Standardize?
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab

Uniformity: Different formats (like 'First Class', 'first class', 'FIRST CLASS') are treated as
separate categories by most data analysis tools. Standardizing ensures all data in a
category is uniform, facilitating accurate analysis.
Error Reduction: Inconsistent data can lead to errors in analysis, such as skewed results
or incorrect interpretations.
Ease of Use: Standardized data is easier to work with, especially for sorting, filtering, and
applying machine learning algorithms.

Impact of Typos or Different Case Formatting


Consequences of Typos and Case Differences
Misleading Analysis: Typos or varying case formats can create artificial categories,
leading to incorrect conclusions or insights from the data.
Data Integrity: Typos can undermine the reliability of your dataset, making it appear less
trustworthy or accurate.
Efficiency: Cleaning and correcting these issues after the fact can be time-consuming.
It's more efficient to standardize as an early step in data preprocessing.
Standardization Strategies
Convert all text to a consistent case (lowercase or uppercase).
Use string matching or manual inspection to identify and correct common typos.

Exploratory Data Analysis (EDA)


Introduction to EDA
Exploratory Data Analysis, commonly known as EDA, is an approach to analyzing
datasets to summarize their main characteristics, often using visual methods. It is a
critical step in the data science process as it allows for a better understanding of the
data’s underlying structure, relationships, and potential patterns.
Purpose of EDA
 Discover Patterns: Uncover patterns, anomalies, or relationships that may not be
immediately apparent.
 Test Hypotheses: Formulate and test hypotheses about the drivers of the outcomes
you're studying.
 Check Assumptions: Validate assumptions about the data for further statistical
analyses and model building.
 Prepare for Advanced Analysis: Identify the most important variables and define the
strategy for complex analytical modeling.
EDA Techniques
 Univariate Analysis: Focuses on a single variable. Techniques include summary
statistics, frequency distributions, and visualization via histograms or box plots.
 Bivariate/Multivariate Analysis: Involves two or more variables to identify
relationships, using scatter plots, pair plots, and correlation analysis.
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab

 Correlation Analysis: Determines the extent to which two variables are related using
a correlation coefficient.

Visualization Tools
 Visual representations such as histograms, box plots, scatter plots, and heatmaps
are essential for EDA because they make it easier to identify outliers, understand
distributions, and observe relationships.
Conducting EDA with Python
We will use Python libraries such as Matplotlib and Seaborn, which offer extensive
capabilities for creating static, animated, and interactive visualizations, to gain insights
into our dataset.

Through EDA, we aim to understand the data better, identify any issues that need to be
addressed, and generate insights that can guide further analysis or predictive modeling.
It’s an exploratory phase where curiosity and creativity guide the discovery of trends and
patterns that will inform our subsequent analytical efforts.

Implementation
Import Necessary Libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Load and Inspect the Dataset:


 Loading the dataset using pandas.

file_path= 'C:\\Users\\us\\Desktop\\superstore_final_dataset.csv' # Replace


with your file path

dataset = pd.read_csv(file_path, encoding='ISO-8859-1')

 Basic inspection using .head(), .info()

print(dataset.head())
print(dataset.info())

1. pd.read_csv(): Reads a comma-separated values (CSV) file into a DataFrame.


National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab

2. head(): Returns the first 5 rows of the DataFrame, allowing you to quickly
check the format of your data and the first few entries.
3. info(): Provides a concise summary of your DataFrame, including the number
of non-null values in each column and the data type.

Data Standardization

 Use str.lower() or str.upper() to standardize the text format.


 Correct any known typos or inconsistencies.

In our dataset, we need to fist check the unique values existing in a column. This will give us
clear picture as how many categories are present in a column.
For example, we will check for the unique values in our column Ship_Mode, correct any
typos existing and then standardize to either lower or uppercase.

# Checking for Unique values in 'Ship_Mode' column


print(data['Ship_Mode'].unique())

# Standardizing the 'Ship_Mode' column


data['Ship_Mode'] = data['Ship_Mode'].str.lower()

# Correcting typos if any (example)


data['Ship_Mode'] = data['Ship_Mode'].replace('second cilas', 'second class')

Exploratory Data Analysis (EDA)

You will engage in hands-on practice with the supermarket sales dataset, applying various
EDA techniques to investigate sales trends, customer behavior, and product performance.
 Use .describe () to obtain a descriptive statistical summary of the dataset.

print(dataset.describe())

 Create histograms, box plots, bar plots etc to analyze the distribution of sales data.
Histogram
plt.hist(data['Sales'], bins=50, color='blue', edgecolor='black')
plt.title('Distribution of Sales')
plt.xlabel('Sales')
plt.ylabel('Number of Occurrences')
plt.show()

Bar plots
state_counts = data['State'].value_counts()
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab

plt.figure(figsize=(10,8)) # This line is to increase the size of the plot for better readability
plt.bar(state_counts.index, state_counts.values, color='blue', edgecolor='black')
plt.title('Number of Occurrences by State')
plt.xlabel('State')
plt.ylabel('Number of Occurrences')
plt.xticks(rotation=90) # This rotates the state names to prevent them from overlapping
plt.show()

Box Plots
plt.figure(figsize=(10, 6))
plt.boxplot(data['Sales'])
plt.title('Box Plot of Sales')
plt.ylabel('Sales')
plt.show()

Strip Plot
plt.figure(figsize=(10, 6))
sns.stripplot(x=data['Category'], y=data['Sales'])
plt.title('Sales by Category')
plt.xlabel('Category')
plt.ylabel('Sales')
plt.show()

Lab Task
1. Do data standardization for coloumns J,K,M,N in the dataset provided.
2. Do EDA for the dataset. Consider different variables and how they relate and
contribute to the Sales coloumn.

CONCLUSION
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

file_path = 'C:\\Users\\Student\\Desktop\\superstore_final_dataset (1).csv'

dataset = pd.read_csv(file_path, encoding='ISO-8859-1')


National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab

print(dataset.head())

# Check if 'Row_ID' is unique

is_unique = dataset['Row_ID'].is_unique

print(is_unique)

# Reset index and update 'Row_ID'

dataset = dataset.reset_index(drop=True)

dataset['Row_ID'] = dataset.index + 1

# Check for missing values

missing_values = dataset.isnull().sum()

print(missing_values)

# Handle missing 'Order_ID'

for index, row in dataset.iterrows():

if pd.isnull(row['Order_ID']):

customer_name = row['Customer_Name']

match = dataset[(dataset['Customer_Name'] == customer_name) &


dataset['Order_ID'].notnull()].head(1)

if not match.empty:

dataset.at[index, 'Order_ID'] = match['Order_ID'].values[0]

# Handle missing 'Customer_ID'

for index, row in dataset.iterrows():

if pd.isnull(row['Customer_ID']):
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab

customer_name = row['Customer_Name']

match = dataset[(dataset['Customer_Name'] == customer_name) &


dataset['Customer_ID'].notnull()].head(1)

if not match.empty:

dataset.at[index, 'Customer_ID'] = match['Customer_ID'].values[0]

# Handle missing 'Customer_Name'

for index, row in dataset.iterrows():

if pd.isnull(row['Customer_Name']):

customer_id = row['Customer_ID']

match = dataset[(dataset['Customer_ID'] == customer_id) &


dataset['Customer_Name'].notnull()].head(1)

if not match.empty:

dataset.at[index, 'Customer_Name'] = match['Customer_Name'].values[0]

# Handle missing 'City'

for index, row in dataset.iterrows():

if pd.isnull(row['City']):

customer_name = row['Customer_Name']

match = dataset[(dataset['Customer_Name'] == customer_name) &


dataset['City'].notnull()].head(1)

if not match.empty:

dataset.at[index, 'City'] = match['City'].values[0]

# Handle missing 'Segment'

for index, row in dataset.iterrows():

if pd.isnull(row['Segment']):

customer_name = row['Customer_Name']

match = dataset[(dataset['Customer_Name'] == customer_name) &


dataset['Segment'].notnull()].head(1)
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab

if not match.empty:

dataset.at[index, 'Segment'] = match['Segment'].values[0]

# Handle missing 'Order_Date'

for index, row in dataset.iterrows():

if pd.isnull(row['Order_Date']):

customer_name = row['Customer_Name']

match = dataset[(dataset['Customer_Name'] == customer_name) &


dataset['Order_Date'].notnull()].head(1)

if not match.empty:

dataset.at[index, 'Order_Date'] = match['Order_Date'].values[0]

else:

# Set 'Order_Date' to be 3 days before 'Ship_Date'

ship_date = pd.to_datetime(row['Ship_Date'])

order_date = ship_date - pd.DateOffset(days=3)

dataset.at[index, 'Order_Date'] = order_date

# Handle missing 'Ship_Date'

for index, row in dataset.iterrows():

if pd.isnull(row['Ship_Date']):

customer_name = row['Customer_Name']

match = dataset[(dataset['Customer_Name'] == customer_name) &


dataset['Ship_Date'].notnull()].head(1)

if not match.empty:

dataset.at[index, 'Ship_Date'] = match['Ship_Date'].values[0]

else:

# Set 'Ship_Date' to be 3 days after 'Order_Date'

order_date = pd.to_datetime(row['Order_Date'])
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab

ship_date = order_date + pd.DateOffset(days=3)

dataset.at[index, 'Ship_Date'] = ship_date

missing_values = dataset.isnull().sum()

print(missing_values)

## Check if 'Row_ID' is unique

#is_unique = dataset['City'].is_unique

#print(is_unique)

# Standardizing the 'Ship_Mode' column

dataset['City'] = dataset['City'].str.lower().str.title()

dataset['State'] = dataset['State'].str.lower().str.title()

dataset['Category'] = dataset['Category'].str.lower().str.title()

dataset['Sub_Categoryy'] = dataset['Sub_Category'].str.lower().str.title()

plt.hist(dataset['Sales'], bins=50, color='blue', edgecolor='black')

plt.title('Distribution of Sales')

plt.xlabel('Sales')

plt.ylabel('Number of Occurrences')

plt.show()

##import pandas as pd

##import matplotlib.pyplot as plt


National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab

##import seaborn as sns

## Assuming 'Order_Date' is in datetime format, if not, convert it

#dataset['Order_Date'] = pd.to_datetime(dataset['Order_Date'])

## Extract month from 'Order_Date'

#dataset['Month'] = dataset['Order_Date'].dt.month

## Group data by month and sub-category, summing up sales for each combination

#monthly_subcategory_sales = dataset.groupby(['Month', 'Sub_Category']).agg({'Sales':


'sum'}).reset_index()

## Pivot the data for plotting

#pivot_data = monthly_subcategory_sales.pivot(index='Month', columns='Sub_Category',


values='Sales')

## Plot grouped bar chart

#plt.figure(figsize=(15, 8))

#sns.set_palette("husl")

#pivot_data.plot(kind='bar', stacked=True)

#plt.title('Monthly Sales Comparison by Sub-Category')

#plt.xlabel('Month')

#plt.ylabel('Total Sales')

#plt.xticks(rotation=0)

#plt.legend(title='Sub-Category', bbox_to_anchor=(1.05, 1), loc='upper left')

#plt.show()

##import pandas as pd
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab

##import matplotlib.pyplot as plt

##import seaborn as sns

## Assuming 'Order_Date' is in datetime format, if not, convert it

#dataset['Order_Date'] = pd.to_datetime(dataset['Order_Date'])

## Extract month from 'Order_Date'

#dataset['Month'] = dataset['Order_Date'].dt.month

## Get unique sub-categories

#unique_subcategories = dataset['Sub_Category'].unique()

## Create a separate histogram for each sub-category

#for subcategory in unique_subcategories:

# # Filter dataset for the current sub-category

# subcategory_data = dataset[dataset['Sub_Category'] == subcategory]

# # Plot histogram for the current sub-category

# plt.figure(figsize=(8, 6))

# plt.hist(subcategory_data['Month'], bins=12, edgecolor='black', color='skyblue', alpha=0.7)

# plt.title(f'Seasonal Analysis for {subcategory}')

# plt.xlabel('Month')

# plt.ylabel('Number of Orders')

# plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov',
'Dec'])

# plt.show()
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab

#import pandas as pd

#import matplotlib.pyplot as plt

#import seaborn as sns

# Assuming 'Order_Date' is in datetime format, if not, convert it

dataset['Order_Date'] = pd.to_datetime(dataset['Order_Date'])

# Extract month from 'Order_Date'

dataset['Month'] = dataset['Order_Date'].dt.month

# Group data by month and sub-category, summing up sales for each combination

monthly_subcategory_sales = dataset.groupby(['Month', 'Sub_Category']).agg({'Sales':


'sum'}).reset_index()

# Pivot the data for plotting

pivot_data = monthly_subcategory_sales.pivot(index='Month', columns='Sub_Category',


values='Sales')

# Plot grouped bar chart

plt.figure(figsize=(15, 8))

sns.set_palette("husl")

pivot_data.plot(kind='bar', stacked=True)

plt.title('Monthly Sales Comparison by Sub-Category')

plt.xlabel('Month')

plt.ylabel('Total Sales')

plt.xticks(rotation=0)

plt.legend(title='Sub-Category', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.show()
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab

You might also like