Lab07ML - f40
Lab07ML - f40
LAB No: 07
ID NO: F20603040
Objective:
Tools/Software Requirements:
Python 3.x
Jupyter Notebook or any other Python IDE
Pandas and NumPy libraries for data manipulation
Matplotlib and Seaborn libraries for data visualization
Sample supermarket sales dataset in CSV format
Data Standardization
Data standardization is about bringing consistency to your data, which is vital for
accurate analysis and interpretation. Specifically, we'll focus on standardizing categorical
data, a common challenge in real-world datasets.
Categorical data often contains variations in formatting, such as differing capitalizations
or typographical errors, which can lead to incorrect categorization. For instance, 'First
Class' and 'first class' might be treated as two distinct categories by data analysis tools,
even though they represent the same class of data. Such discrepancies can significantly
skew your analysis, leading to unreliable outcomes.
Our goal in this lab is to learn how to identify and correct these inconsistencies. We will
use Python and its powerful libraries to standardize categorical data, ensuring that our
dataset is clean, consistent, and ready for further analysis or machine learning tasks.
This step is essential in the data preprocessing phase and lays the foundation for any
subsequent data analysis or modeling.
Uniformity: Different formats (like 'First Class', 'first class', 'FIRST CLASS') are treated as
separate categories by most data analysis tools. Standardizing ensures all data in a
category is uniform, facilitating accurate analysis.
Error Reduction: Inconsistent data can lead to errors in analysis, such as skewed results
or incorrect interpretations.
Ease of Use: Standardized data is easier to work with, especially for sorting, filtering, and
applying machine learning algorithms.
Correlation Analysis: Determines the extent to which two variables are related using
a correlation coefficient.
Visualization Tools
Visual representations such as histograms, box plots, scatter plots, and heatmaps
are essential for EDA because they make it easier to identify outliers, understand
distributions, and observe relationships.
Conducting EDA with Python
We will use Python libraries such as Matplotlib and Seaborn, which offer extensive
capabilities for creating static, animated, and interactive visualizations, to gain insights
into our dataset.
Through EDA, we aim to understand the data better, identify any issues that need to be
addressed, and generate insights that can guide further analysis or predictive modeling.
It’s an exploratory phase where curiosity and creativity guide the discovery of trends and
patterns that will inform our subsequent analytical efforts.
Implementation
Import Necessary Libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
print(dataset.head())
print(dataset.info())
2. head(): Returns the first 5 rows of the DataFrame, allowing you to quickly
check the format of your data and the first few entries.
3. info(): Provides a concise summary of your DataFrame, including the number
of non-null values in each column and the data type.
Data Standardization
In our dataset, we need to fist check the unique values existing in a column. This will give us
clear picture as how many categories are present in a column.
For example, we will check for the unique values in our column Ship_Mode, correct any
typos existing and then standardize to either lower or uppercase.
You will engage in hands-on practice with the supermarket sales dataset, applying various
EDA techniques to investigate sales trends, customer behavior, and product performance.
Use .describe () to obtain a descriptive statistical summary of the dataset.
print(dataset.describe())
Create histograms, box plots, bar plots etc to analyze the distribution of sales data.
Histogram
plt.hist(data['Sales'], bins=50, color='blue', edgecolor='black')
plt.title('Distribution of Sales')
plt.xlabel('Sales')
plt.ylabel('Number of Occurrences')
plt.show()
Bar plots
state_counts = data['State'].value_counts()
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab
plt.figure(figsize=(10,8)) # This line is to increase the size of the plot for better readability
plt.bar(state_counts.index, state_counts.values, color='blue', edgecolor='black')
plt.title('Number of Occurrences by State')
plt.xlabel('State')
plt.ylabel('Number of Occurrences')
plt.xticks(rotation=90) # This rotates the state names to prevent them from overlapping
plt.show()
Box Plots
plt.figure(figsize=(10, 6))
plt.boxplot(data['Sales'])
plt.title('Box Plot of Sales')
plt.ylabel('Sales')
plt.show()
Strip Plot
plt.figure(figsize=(10, 6))
sns.stripplot(x=data['Category'], y=data['Sales'])
plt.title('Sales by Category')
plt.xlabel('Category')
plt.ylabel('Sales')
plt.show()
Lab Task
1. Do data standardization for coloumns J,K,M,N in the dataset provided.
2. Do EDA for the dataset. Consider different variables and how they relate and
contribute to the Sales coloumn.
CONCLUSION
import pandas as pd
import numpy as np
print(dataset.head())
is_unique = dataset['Row_ID'].is_unique
print(is_unique)
dataset = dataset.reset_index(drop=True)
dataset['Row_ID'] = dataset.index + 1
missing_values = dataset.isnull().sum()
print(missing_values)
if pd.isnull(row['Order_ID']):
customer_name = row['Customer_Name']
if not match.empty:
if pd.isnull(row['Customer_ID']):
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab
customer_name = row['Customer_Name']
if not match.empty:
if pd.isnull(row['Customer_Name']):
customer_id = row['Customer_ID']
if not match.empty:
if pd.isnull(row['City']):
customer_name = row['Customer_Name']
if not match.empty:
if pd.isnull(row['Segment']):
customer_name = row['Customer_Name']
if not match.empty:
if pd.isnull(row['Order_Date']):
customer_name = row['Customer_Name']
if not match.empty:
else:
ship_date = pd.to_datetime(row['Ship_Date'])
if pd.isnull(row['Ship_Date']):
customer_name = row['Customer_Name']
if not match.empty:
else:
order_date = pd.to_datetime(row['Order_Date'])
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab
missing_values = dataset.isnull().sum()
print(missing_values)
#is_unique = dataset['City'].is_unique
#print(is_unique)
dataset['City'] = dataset['City'].str.lower().str.title()
dataset['State'] = dataset['State'].str.lower().str.title()
dataset['Category'] = dataset['Category'].str.lower().str.title()
dataset['Sub_Categoryy'] = dataset['Sub_Category'].str.lower().str.title()
plt.title('Distribution of Sales')
plt.xlabel('Sales')
plt.ylabel('Number of Occurrences')
plt.show()
##import pandas as pd
#dataset['Order_Date'] = pd.to_datetime(dataset['Order_Date'])
#dataset['Month'] = dataset['Order_Date'].dt.month
## Group data by month and sub-category, summing up sales for each combination
#plt.figure(figsize=(15, 8))
#sns.set_palette("husl")
#pivot_data.plot(kind='bar', stacked=True)
#plt.xlabel('Month')
#plt.ylabel('Total Sales')
#plt.xticks(rotation=0)
#plt.show()
##import pandas as pd
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab
#dataset['Order_Date'] = pd.to_datetime(dataset['Order_Date'])
#dataset['Month'] = dataset['Order_Date'].dt.month
#unique_subcategories = dataset['Sub_Category'].unique()
# plt.figure(figsize=(8, 6))
# plt.xlabel('Month')
# plt.ylabel('Number of Orders')
# plt.xticks(range(1, 13), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov',
'Dec'])
# plt.show()
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab
#import pandas as pd
dataset['Order_Date'] = pd.to_datetime(dataset['Order_Date'])
dataset['Month'] = dataset['Order_Date'].dt.month
# Group data by month and sub-category, summing up sales for each combination
plt.figure(figsize=(15, 8))
sns.set_palette("husl")
pivot_data.plot(kind='bar', stacked=True)
plt.xlabel('Month')
plt.ylabel('Total Sales')
plt.xticks(rotation=0)
plt.show()
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab