0% found this document useful (0 votes)
46 views6 pages

Supermarket Sales Data Analysis

None

Uploaded by

gutgit026
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views6 pages

Supermarket Sales Data Analysis

None

Uploaded by

gutgit026
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Supermarket Sales Data analysis

Here’s a beginner-level Python Pandas project idea to help you practice and strengthen your
skills:

Project: Analyzing Supermarket Sales Data


Objective

Perform basic data analysis on supermarket sales data to derive meaningful insights using
Pandas.

Steps and Instructions

1. Dataset

You can use a public dataset like the "Supermarket Sales" dataset from Kaggle (link to dataset)
or create your own dummy dataset in a CSV file.

https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales?resource=download

Sample Dataset Columns:

1. Invoice ID
2. Branch
3. Customer Type (e.g., Member, Normal)
4. Gender
5. Product Line
6. Unit Price
7. Quantity
8. Total
9. Date
10. Payment Method
11. Rating

2. Tasks to Perform
1. Load the dataset:

○ Read the CSV file into a Pandas DataFrame.


○ Display the first 10 rows and understand the structure.
2. Data Exploration:

○Check for missing values.


○Understand data types and convert them if necessary (e.g., convert Date to
datetime).
○ Generate summary statistics (mean, median, min, max, etc.).
3. Data Manipulation:

○Add a new column: Compute "Total Sales" for each product (Unit Price ×
Quantity).
○ Filter data: Extract sales records for a specific branch or product line.
4. Data Aggregation:

○ Calculate the total sales per branch.


○ Group by Product Line and find the average rating for each product line.
○ Identify the branch with the highest sales.
5. Visualization (Optional):
Use Matplotlib or Seaborn to create simple plots like:

○ Sales distribution by branch.


○ Average rating by product line.
○ Total sales by payment method.

3. Code Structure

Here’s a basic outline to guide you:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset


data = pd.read_csv("supermarket_sales.csv")

# Data Exploration
print(data.info())
print(data.describe())
print(data.isnull().sum())
# Data Manipulation
data['Total Sales'] = data['Unit Price'] * data['Quantity']

# Aggregation
total_sales_by_branch = data.groupby('Branch')['Total Sales'].sum()
average_rating_by_product_line = data.groupby('Product Line')['Rating'].mean()

# Visualizations
sns.barplot(x=total_sales_by_branch.index, y=total_sales_by_branch.values)
plt.title("Total Sales by Branch")
plt.show()

Expected Outcome

By the end of this project, you should be comfortable with:

● Loading and exploring data using Pandas.


● Performing basic data manipulations.
● Summarizing and aggregating data using Pandas functions.
● Optionally, visualizing data with Matplotlib or Seaborn.

comprehensive list of Pandas methods


Here’s a comprehensive list of Pandas methods and some additional Python functions you
might use for the project:

1. Data Loading

● pd.read_csv(filepath) – Load a CSV file into a DataFrame.


● .head(n) – Display the first n rows of the DataFrame.
● .tail(n) – Display the last n rows.
● .sample(n) – Randomly sample n rows.
2. Data Exploration

● .info() – Get a summary of the DataFrame structure, including column data types and
non-null counts.
● .shape – Get the dimensions of the DataFrame (rows, columns).
● .columns – Get the list of column names.
● .describe() – Generate summary statistics for numerical columns.
● .isnull() – Check for missing values (returns a boolean DataFrame).
● .isnull().sum() – Count missing values for each column.
● .dtypes – Get data types of each column.
● .unique() – Get unique values in a column.
● .value_counts() – Count occurrences of unique values in a column.

3. Data Manipulation

● Adding a new column:


df['New Column'] = df['Column1'] * df['Column2']

● Renaming columns:
.rename(columns={'Old Name': 'New Name'})

● Filtering rows:

○ Conditional filtering: df[df['Column'] > value]


○ Multiple conditions: df[(df['Column1'] > value1) & (df['Column2']
== 'condition')]
● Sorting:
.sort_values(by='Column', ascending=True)

● Resetting index:
.reset_index(drop=True)

● Dropping columns or rows:


.drop(columns=['Column1', 'Column2'])
.drop(index=[0, 1])

● Changing data types:


.astype({'Column': 'datatype'})
.to_datetime(df['Column']) – Convert a column to datetime.

4. Data Aggregation and Grouping

● .groupby('Column') – Group data by one or more columns.

● Aggregation functions:

○ .sum() – Calculate the sum.


○ .mean() – Calculate the mean.
○ .median() – Calculate the median.
○ .min() and .max() – Get minimum and maximum values.
○ .count() – Count the number of non-null values.
○ .agg({'Col1': 'mean', 'Col2': 'sum'}) – Apply multiple aggregations.
● df.pivot_table(values='Column', index='Col1', columns='Col2',
aggfunc='sum') – Create pivot tables.

5. Visualization (Optional)

Use Matplotlib or Seaborn:

● Matplotlib

○ plt.plot() – Plot data.


○ plt.bar() – Create bar plots.
○ plt.pie() – Create pie charts.
○ plt.show() – Display the plot.
● Seaborn

○ sns.barplot(x, y) – Create bar plots.


○ sns.histplot(x) – Create histograms.
○ sns.heatmap() – Display heatmaps.

6. Saving the Results


● df.to_csv('filename.csv', index=False) – Save the DataFrame to a CSV file.

7. General Python Functions

● len(df) – Get the number of rows in the DataFrame.


● set() – Get unique values (alternative to .unique()).
● round(number, decimals) – Round values to a specified number of decimals.

You might also like