0% found this document useful (0 votes)
56 views18 pages

Divyanshi 05401172023 Ds Practical

Uploaded by

diviyanshimehra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views18 pages

Divyanshi 05401172023 Ds Practical

Uploaded by

diviyanshimehra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

import pandas as pd

import matplotlib.pyplot as plt


import seaborn as sns

# Load the dataset


data = pd.read_csv('Walmart Sales data.csv.csv') # Replace
'sales_data.csv' with your actual file path

# Display the first few rows of the dataset to understand its


structure
print(data.head())

Invoice ID Branch City Customer type Gender \


0 750-67-8428 A Yangon Member Female
1 226-31-3081 C Naypyitaw Normal Female
2 631-41-3108 A Yangon Normal Male
3 123-19-1176 A Yangon Member Male
4 373-73-7910 A Yangon Normal Male

Product line Unit price Quantity Tax 5% Total \


0 Health and beauty 74.69 7 26.1415 548.9715
1 Electronic accessories 15.28 5 3.8200 80.2200
2 Home and lifestyle 46.33 7 16.2155 340.5255
3 Health and beauty 58.22 8 23.2880 489.0480
4 Sports and travel 86.31 7 30.2085 634.3785

Date Time Payment cogs gross margin percentage


\
0 2019-01-05 13:08:00 Ewallet 522.83 4.761905

1 2019-03-08 10:29:00 Cash 76.40 4.761905

2 2019-03-03 13:23:00 Credit card 324.31 4.761905

3 2019-01-27 20:33:00 Ewallet 465.76 4.761905

4 2019-02-08 10:37:00 Ewallet 604.17 4.761905

gross income Rating


0 26.1415 9.1
1 3.8200 9.6
2 16.2155 7.4
3 23.2880 8.4
4 30.2085 5.3
Q1. How many distinct cities are present in the
dataset?
distinct_cities = data['City'].nunique()
print("Number of distinct cities:", distinct_cities)

Number of distinct cities: 3

Q2. In which city is each branch situated?


branch_city_mapping = data.groupby('Branch')['City'].unique()
print("Branches and their respective cities:")
for branch, city in branch_city_mapping.items():
print("Branch:", branch, "-> City:", city)

Branches and their respective cities:


Branch: A -> City: ['Yangon']
Branch: B -> City: ['Mandalay']
Branch: C -> City: ['Naypyitaw']

Visualizations

Distribution of sales across branches


plt.figure(figsize=(10, 6))
sns.countplot(x='Branch', data=data, palette='Set2')
plt.title('Distribution of Sales Across Branches')
plt.xlabel('Branch')
plt.ylabel('Number of Sales')
plt.show()
Distribution of customer types
plt.figure(figsize=(10, 6))
sns.countplot(x='Customer type', data=data, palette='Pastel1')
plt.title('Distribution of Customer Types')
plt.xlabel('Customer Type')
plt.ylabel('Number of Customers')
plt.show()
Gender distribution
plt.figure(figsize=(10, 6))
sns.countplot(x='Gender', data=data, palette='Dark2')
plt.title('Gender Distribution of Customers')
plt.xlabel('Gender')
plt.ylabel('Number of Customers')
plt.show()
Product line distribution
plt.figure(figsize=(12, 6))
sns.countplot(y='Product line', data=data, palette='Set3')
plt.title('Distribution of Product Lines')
plt.xlabel('Number of Sales')
plt.ylabel('Product Line')
plt.show()
Distribution of ratings
plt.figure(figsize=(10, 6))
sns.histplot(data['Rating'], bins=10, kde=True, color='skyblue')
plt.title('Distribution of Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()
3. How many distinct product lines are there in
the dataset?
distinct_product_lines = data['Product line'].nunique()
print("Number of distinct product lines:", distinct_product_lines)

Number of distinct product lines: 6

4. What is the most common payment method?


most_common_payment_method = data['Payment'].mode()[0]
print("Most common payment method:", most_common_payment_method)

Most common payment method: Ewallet

5. What is the most selling product line?


most_selling_product_line = data['Product
line'].value_counts().idxmax()
print("Most selling product line:", most_selling_product_line)

Most selling product line: Fashion accessories

6. What is the total revenue by month?


# Convert 'date' column to datetime format
data['Date'] = pd.to_datetime(data['Date'])

# Extract month from the 'date' column


data['month'] = data['Date'].dt.month

# Calculate total revenue by month


total_revenue_by_month = data.groupby('month')['Total'].sum()
print("Total revenue by month:")
print(total_revenue_by_month)

Total revenue by month:


month
1 116291.868
2 97219.374
3 109455.507
Name: Total, dtype: float64

7. Which month recorded the highest Cost of


Goods Sold (COGS)?
# Convert 'date' column to datetime format
data['Date'] = pd.to_datetime(data['Date'])

# Ensure the column has been converted successfully


print(data['Date'].dtype)

# Calculate total revenue by month


total_revenue_by_month = data.groupby(data['Date'].dt.month)
['Total'].sum()
print("Total revenue by month:")
print(total_revenue_by_month)

# Find the month with the highest Cost of Goods Sold (COGS)
highest_cogs_month = data.groupby(data['Date'].dt.month)
['cogs'].sum().idxmax()
print("Month with the highest Cost of Goods Sold (COGS):",
highest_cogs_month)
datetime64[ns]
Total revenue by month:
Date
1 116291.868
2 97219.374
3 109455.507
Name: Total, dtype: float64
Month with the highest Cost of Goods Sold (COGS): 1

8. Which product line generated the highest


revenue?
highest_revenue_product_line = data.groupby('Product line')
['Total'].sum().idxmax()
print("Product line with the highest revenue:",
highest_revenue_product_line)

Product line with the highest revenue: Food and beverages

9. Which city has the highest revenue?


highest_revenue_city = data.groupby('City')['Total'].sum().idxmax()
print("City with the highest revenue:", highest_revenue_city)

City with the highest revenue: Naypyitaw

10. Which product line incurred the highest


VAT?
# 10. Which product line incurred the highest VAT?
highest_vat_product_line = data.groupby('Product line')['Tax
5%'].sum().idxmax()
print("Product line with the highest VAT:", highest_vat_product_line)

Product line with the highest VAT: Food and beverages


11. Retrieve each product line and add a column
product_category, indicating 'Good' or 'Bad,'

based on whether its sales are above the


average.
# 11. Retrieve each product line and add a column product_category,
indicating 'Good' or 'Bad',
# based on whether its sales are above the average.

average_quantity_sold = data['Quantity'].mean()

# Function to categorize sales


def categorize_sales(quantity):
if quantity > average_quantity_sold:
return 'Good'
else:
return 'Bad'

# Apply the function to create the product category column


data['product_category'] = data['Quantity'].apply(categorize_sales)

# Display the updated DataFrame with the new column


print(data[['Product line', 'Quantity', 'product_category']].head())

Product line Quantity product_category


0 Health and beauty 7 Good
1 Electronic accessories 5 Bad
2 Home and lifestyle 7 Good
3 Health and beauty 8 Good
4 Sports and travel 7 Good

12. Which branch sold more products than


average product sold?
# 12. Which branch sold more products than average product sold?
branch_product_counts = data.groupby('Branch')['Quantity'].sum()
branch_more_than_average = branch_product_counts[branch_product_counts
> average_sales].index.tolist()
print("Branch(es) with more products sold than the average:",
branch_more_than_average)
Branch(es) with more products sold than the average: ['A', 'B', 'C']

13. What is the most common product line by


gender?
# 13. What is the most common product line by gender?
common_product_line_by_gender = data.groupby(['Gender', 'Product
line']).size().idxmax()
print("Most common product line by gender:",
common_product_line_by_gender[1])

Most common product line by gender: Fashion accessories

14. What is the average rating of each product


line?
# 14. What is the average rating of each product line?
average_rating_by_product_line = data.groupby('Product line')
['Rating'].mean()
print("Average rating of each product line:")
print(average_rating_by_product_line)

Average rating of each product line:


Product line
Electronic accessories 6.924706
Fashion accessories 7.029213
Food and beverages 7.113218
Health and beauty 7.003289
Home and lifestyle 6.837500
Sports and travel 6.916265
Name: Rating, dtype: float64

15. Number of sales made in each time of the


day per weekday
# 15. Number of sales made in each time of the day per weekday
data['weekday'] = data['Date'].dt.weekday
sales_per_time_per_weekday = data.groupby(['weekday', 'Time']).size()
print("Number of sales made in each time of the day per weekday:")
print(sales_per_time_per_weekday)

Number of sales made in each time of the day per weekday:


weekday Time
0 10:00:00 1
10:02:00 1
10:05:00 1
10:11:00 1
10:23:00 2
..
6 20:33:00 1
20:37:00 1
20:38:00 1
20:46:00 1
20:51:00 1
Length: 914, dtype: int64

16. Identify the customer type that generates


the highest revenue.
# 16. Identify the customer type that generates the highest revenue.
highest_revenue_customer_type = data.groupby('Customer type')
['Total'].sum().idxmax()
print("Customer type that generates the highest revenue:",
highest_revenue_customer_type)

Customer type that generates the highest revenue: Member

17. Which city has the largest tax percent/ VAT


(Value Added Tax)?
# 17. Which city has the largest tax percent/ VAT (Value Added Tax)?
city_with_largest_vat_percent = data.groupby('City')['Tax
5%'].mean().idxmax()
print("City with the largest tax percent/ VAT:",
city_with_largest_vat_percent)

City with the largest tax percent/ VAT: Naypyitaw


18. Which customer type pays the most VAT?
# 18. Which customer type pays the most VAT?
customer_type_with_most_vat = data.groupby('Customer type')['Tax
5%'].sum().idxmax()
print("Customer type that pays the most VAT:",
customer_type_with_most_vat)

Customer type that pays the most VAT: Member

19. How many unique customer types does the


data have?
# 19. How many unique customer types does the data have?
unique_customer_types = data['Customer type'].nunique()
print("Number of unique customer types:", unique_customer_types)

Number of unique customer types: 2

20. How many unique payment methods does


the data have?
# 20. How many unique payment methods does the data have?
unique_payment_methods = data['Payment'].nunique()
print("Number of unique payment methods:", unique_payment_methods)

Number of unique payment methods: 3

21. Which is the most common customer type?


most_common_customer_type = data['Customer type'].mode()[0]
print("Most common customer type:", most_common_customer_type)

Most common customer type: Member

22. Which customer type buys the most?


most_buying_customer_type = data.groupby('Customer type')
['Quantity'].sum().idxmax()
print("Customer type that buys the most:", most_buying_customer_type)

Customer type that buys the most: Member

23. What is the gender of most of the


customers?
most_common_gender = data['Gender'].mode()[0]
print("Gender of most of the customers:", most_common_gender)

Gender of most of the customers: Female

24. What is the gender distribution per branch?


gender_distribution_per_branch = data.groupby(['Branch',
'Gender']).size()
print("Gender distribution per branch:")
print(gender_distribution_per_branch)

Gender distribution per branch:


Branch Gender
A Female 161
Male 179
B Female 162
Male 170
C Female 178
Male 150
dtype: int64

25. Which time of the day do customers give


most ratings?
most_rated_time_of_day = data.groupby('Time')['Rating'].sum().idxmax()
print("Time of the day when customers give the most ratings:",
most_rated_time_of_day)

Time of the day when customers give the most ratings: 19:48:00
26. Which time of the day do customers give
most ratings per branch?
most_rated_time_of_day_per_branch = data.groupby(['Branch', 'Time'])
['Rating'].sum().idxmax()
print("Time of the day when customers give the most ratings per
branch:", most_rated_time_of_day_per_branch)

Time of the day when customers give the most ratings per branch: ('C',
'10:23:00')

27. Which day of the week has the best avg


ratings?
best_avg_ratings_day_of_week = data.groupby(data['Date'].dt.dayofweek)
['Rating'].mean().idxmax()
print("Day of the week with the best average ratings:",
best_avg_ratings_day_of_week)

Day of the week with the best average ratings: 0

28. Which day of the week has the best average


ratings per branch?
best_avg_ratings_day_of_week_per_branch = data.groupby(['Branch',
data['Date'].dt.dayofweek])['Rating'].mean().idxmax()
print("Day of the week with the best average ratings per branch:",
best_avg_ratings_day_of_week_per_branch)

Day of the week with the best average ratings per branch: ('B', 0)

29. Are there any patterns or trends in sales


over time (by month, day of the week, or time of
day)?
# For example, we can visualize total sales over time
import matplotlib.pyplot as plt

# Extract month, day of the week, and hour from the date
data['Month'] = data['Date'].dt.month
data['DayOfWeek'] = data['Date'].dt.dayofweek
data['Hour'] = data['Time'].apply(lambda x: int(x.split(':')[0]))

# Total sales by month


total_sales_by_month = data.groupby('Month')['Total'].sum()

# Total sales by day of the week


total_sales_by_day_of_week = data.groupby('DayOfWeek')['Total'].sum()

# Total sales by hour of the day


total_sales_by_hour = data.groupby('Hour')['Total'].sum()

# Plotting
plt.figure(figsize=(18, 5))

plt.subplot(1, 3, 1)
plt.plot(total_sales_by_month, marker='o')
plt.title('Total Sales by Month')
plt.xlabel('Month')
plt.ylabel('Total Sales')

plt.subplot(1, 3, 2)
plt.plot(total_sales_by_day_of_week, marker='o')
plt.title('Total Sales by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Total Sales')

plt.subplot(1, 3, 3)
plt.plot(total_sales_by_hour, marker='o')
plt.title('Total Sales by Hour of the Day')
plt.xlabel('Hour of the Day')
plt.ylabel('Total Sales')

plt.tight_layout()
plt.show()
30. Are there any differences in customer
ratings between branches?
ratings_by_branch = data.groupby('Branch')['Rating'].mean()
print("Average ratings by branch:")
print(ratings_by_branch)

Average ratings by branch:


Branch
A 7.027059
B 6.818072
C 7.072866
Name: Rating, dtype: float64

31. Is there any correlation between the tax


amount and the total transaction amount?
correlation_tax_total = data['Tax 5%'].corr(data['Total'])
print("Correlation between tax amount and total transaction amount:",
correlation_tax_total)

Correlation between tax amount and total transaction amount:


0.9999999999999998

32. Do certain product lines tend to have higher


ratings than others?
ratings_by_product_line = data.groupby('Product line')
['Rating'].mean()
print("Average ratings by product line:")
print(ratings_by_product_line)

Average ratings by product line:


Product line
Electronic accessories 6.924706
Fashion accessories 7.029213
Food and beverages 7.113218
Health and beauty 7.003289
Home and lifestyle 6.837500
Sports and travel 6.916265
Name: Rating, dtype: float64

33. Is there any correlation between the


quantity of items purchased and the total
transaction amount?
correlation_quantity_total = data['Quantity'].corr(data['Total'])
print("Correlation between quantity and total transaction amount:",
correlation_quantity_total)

You might also like