0% found this document useful (0 votes)
31 views4 pages

Solution

Uploaded by

kkesarkar5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views4 pages

Solution

Uploaded by

kkesarkar5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Step 1: Identify at least 10 major KPIs that would be useful for the business

Based on the dataset, I have identified the following 10 major KPIs that would be useful for the
business:

 Sales Revenue: Total sales revenue generated by the supermarket chain

 Customer Count: Number of unique customers who have made purchases

 Average Order Value (AOV): Average amount spent by customers in a single transaction

 Customer Retention Rate: Percentage of customers who have made repeat purchases

 Product Category Sales: Sales revenue generated by each product category (e.g. dairy,
bakery, etc.)

 Top-Selling Products: Products that have generated the highest sales revenue

 Region-wise Sales: Sales revenue generated by each region (e.g. Chennai, Coimbatore, etc.)

 State-wise Sales: Sales revenue generated by each state (e.g. Tamil Nadu, Karnataka, etc.)

 Gross Margin: Difference between revenue and cost of goods sold

 Inventory Turnover: Number of times inventory is sold and replaced within a given period

Step 2: Load the dataset and perform Data Preprocessing, Outlier Detection, and Exploratory Data
Analysis

To perform data preprocessing, outlier detection, and exploratory data analysis, I will use Python
with the Pandas and NumPy libraries.

import pandas as pd

import numpy as np

# Load the dataset

df = pd.read_csv('Supermart Grocery Sales - Retail Analytics Dataset.csv')

# Data Preprocessing

# Check for missing values

print(df.isnull().sum())

# Handle missing values (e.g. impute with mean or median)

df.fillna(df.mean(), inplace=True)
# Outlier Detection

# Use the Z-score method to detect outliers

from scipy import stats

z_scores = np.abs(stats.zscore(df))

print(z_scores)

# Exploratory Data Analysis

# Summary statistics

print(df.describe())

# Visualize the data using plots and charts

import matplotlib.pyplot as plt

df.plot(kind='bar')

plt.show()

Output:

 Summary statistics of the dataset

 Bar chart showing the distribution of sales revenue by product category

Step 3: Use Association Rule Mining technique to identify the items frequently bought together
and their demands

To perform association rule mining, I will use the Apriori algorithm implemented in the Python
library mlxtend.

from mlxtend.frequent_patterns import apriori

from mlxtend.frequent_patterns import association_rules

# Convert the dataset to a transactional format

transactions = []

for index, row in df.iterrows():

transactions.append(row['Item Name'])

# Perform association rule mining


frequent_itemsets = apriori(transactions, min_support=0.01, use_colnames=True)

rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.5)

# Print the top 10 rules

print(rules.head(10))

Output:

 Top 10 association rules showing the items frequently bought together and their demands

Step 4: Use Classification techniques to develop a model and predict the item categories and sub-
categories that would provide the highest sales and profit region-wise/state-wise

To perform classification, I will use the Scikit-learn library in Python.

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, classification_report

# Prepare the dataset for classification

X = df.drop(['Item Category', 'Item Sub-Category'], axis=1)

y = df['Item Category']

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a random forest classifier

rfc = RandomForestClassifier(n_estimators=100, random_state=42)

rfc.fit(X_train, y_train)

# Make predictions on the testing set

y_pred = rfc.predict(X_test)

# Evaluate the model


print('Accuracy:', accuracy_score(y_test, y_pred))

print('Classification Report:')

print(classification_report(y_test, y_pred))

Output:

 Accuracy and classification report of the random forest classifier

Step 5: Modify the dataset to incorporate the Non-Volatile feature of data warehouse

To modify the dataset to incorporate the Non-Volatile feature of data warehouse, I will create a new
column Version to track changes to the data.

# Create a new column 'Version' to track changes

df['Version'] = 1

# Save the modified dataset to a new CSV file

df.to_csv('Supermart Grocery Sales - Retail Analytics Dataset_Modified

You might also like