0% found this document useful (0 votes)
24 views11 pages

Sales Analysis Project

The document outlines a sales analysis process involving the importation of sales data from multiple CSV files, merging them into a single dataset, and performing data cleaning. It includes steps for adding new columns such as 'Month' and 'Sales', and visualizing sales data by month and city. Additionally, it discusses optimizing advertisement timing based on order timestamps.

Uploaded by

Hend Selmy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views11 pages

Sales Analysis Project

The document outlines a sales analysis process involving the importation of sales data from multiple CSV files, merging them into a single dataset, and performing data cleaning. It includes steps for adding new columns such as 'Month' and 'Sales', and visualizing sales data by month and city. Additionally, it discusses optimizing advertisement timing based on order timestamps.

Uploaded by

Hend Selmy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

sales-analysis

May 3, 2025

0.0.1 Import Necessary Libraries

[1]: import pandas as pd


import os
import matplotlib.pyplot as plt

0.0.2 Merge 12 Month of Sales Data into a Single CSV File

[2]: files = [file for file in os.listdir("./Sales_Data")]


for f in files:
print(f)

Sales_April_2019.csv
Sales_August_2019.csv
Sales_December_2019.csv
Sales_February_2019.csv
Sales_January_2019.csv
Sales_July_2019.csv
Sales_June_2019.csv
Sales_March_2019.csv
Sales_May_2019.csv
Sales_November_2019.csv
Sales_October_2019.csv
Sales_September_2019.csv

[3]: df_list = []

for dir in files:


file_path = os.path.join("./Sales_Data", dir) # use to combine multiple␣
↪parts of a file path

df = pd.read_csv(file_path)
df_list.append(df)

# concate all dataframes


combined_df = pd.concat(df_list, ignore_index=True)

# save all files in one file


combined_df.to_csv("all_data.csv", index=False)

1
[4]: all_data = pd.read_csv("all_data.csv")
all_data.head()

[4]: Order ID Product Quantity Ordered Price Each \


0 176558 USB-C Charging Cable 2 11.95
1 NaN NaN NaN NaN
2 176559 Bose SoundSport Headphones 1 99.99
3 176560 Google Phone 1 600
4 176560 Wired Headphones 1 11.99

Order Date Purchase Address


0 04/19/19 08:46 917 1st St, Dallas, TX 75001
1 NaN NaN
2 04/07/19 22:30 682 Chestnut St, Boston, MA 02215
3 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001
4 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001

0.0.3 Clean up the Data

[5]: all_data.isnull().sum()

[5]: Order ID 545


Product 545
Quantity Ordered 545
Price Each 545
Order Date 545
Purchase Address 545
dtype: int64

[6]: all_data = all_data.dropna(how="all")

[7]: all_data.isnull().sum()

[7]: Order ID 0
Product 0
Quantity Ordered 0
Price Each 0
Order Date 0
Purchase Address 0
dtype: int64

Find “OR” and dlt it


[8]: all_data.columns

[8]: Index(['Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date',
'Purchase Address'],

2
dtype='object')

[9]: all_data = all_data[all_data["Order Date"].str[0:2] != "Or"]

Convert Column to the Correct Type


[10]: all_data["Quantity Ordered"] = pd.to_numeric(all_data["Quantity Ordered"]) #␣
↪make int

all_data["Price Each"] = pd.to_numeric(all_data["Price Each"]) #␣


↪make float

Add Month Column


[11]: all_data["Month"] = all_data["Order Date"].str[0:2]
all_data["Month"] = all_data["Month"].astype("int32")
all_data.head()

[11]: Order ID Product Quantity Ordered Price Each \


0 176558 USB-C Charging Cable 2 11.95
2 176559 Bose SoundSport Headphones 1 99.99
3 176560 Google Phone 1 600.00
4 176560 Wired Headphones 1 11.99
5 176561 Wired Headphones 1 11.99

Order Date Purchase Address Month


0 04/19/19 08:46 917 1st St, Dallas, TX 75001 4
2 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4
3 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4
4 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4
5 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4

Add Sales Column


[12]: all_data["Sales"] = all_data["Quantity Ordered"] * all_data["Price Each"]
all_data

[12]: Order ID Product Quantity Ordered Price Each \


0 176558 USB-C Charging Cable 2 11.95
2 176559 Bose SoundSport Headphones 1 99.99
3 176560 Google Phone 1 600.00
4 176560 Wired Headphones 1 11.99
5 176561 Wired Headphones 1 11.99
… … … … …
186845 259353 AAA Batteries (4-pack) 3 2.99
186846 259354 iPhone 1 700.00
186847 259355 iPhone 1 700.00
186848 259356 34in Ultrawide Monitor 1 379.99
186849 259357 USB-C Charging Cable 1 11.95

3
Order Date Purchase Address Month Sales
0 04/19/19 08:46 917 1st St, Dallas, TX 75001 4 23.90
2 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4 99.99
3 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 600.00
4 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 11.99
5 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4 11.99
… … … … …
186845 09/17/19 20:56 840 Highland St, Los Angeles, CA 90001 9 8.97
186846 09/01/19 16:00 216 Dogwood St, San Francisco, CA 94016 9 700.00
186847 09/23/19 07:39 220 12th St, San Francisco, CA 94016 9 700.00
186848 09/19/19 17:30 511 Forest St, San Francisco, CA 94016 9 379.99
186849 09/30/19 00:18 250 Meadow St, San Francisco, CA 94016 9 11.95

[185950 rows x 8 columns]

0.0.4 Which is the Best Month for Sales?


[13]: results = all_data.groupby("Month").sum()

[14]: plt.figure(figsize=(6,4))
months = range(1,13)
plt.bar(months, results["Sales"])
plt.xticks(months)
plt.xlabel("Month Number")
plt.ylabel("Sales in USD ($)")
plt.show()

4
Add City Column
[15]: def get_city(address):
return address.split(",")[1]

def get_state(address):
return address.split(",")[2].split(" ")[1]

all_data["City"] = all_data["Purchase Address"].apply(lambda x: f"{get_city(x)}␣


↪({get_state(x)})")

all_data

[15]: Order ID Product Quantity Ordered Price Each \


0 176558 USB-C Charging Cable 2 11.95
2 176559 Bose SoundSport Headphones 1 99.99
3 176560 Google Phone 1 600.00
4 176560 Wired Headphones 1 11.99
5 176561 Wired Headphones 1 11.99
… … … … …
186845 259353 AAA Batteries (4-pack) 3 2.99
186846 259354 iPhone 1 700.00
186847 259355 iPhone 1 700.00
186848 259356 34in Ultrawide Monitor 1 379.99
186849 259357 USB-C Charging Cable 1 11.95

5
Order Date Purchase Address Month \
0 04/19/19 08:46 917 1st St, Dallas, TX 75001 4
2 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4
3 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4
4 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4
5 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4
… … … …
186845 09/17/19 20:56 840 Highland St, Los Angeles, CA 90001 9
186846 09/01/19 16:00 216 Dogwood St, San Francisco, CA 94016 9
186847 09/23/19 07:39 220 12th St, San Francisco, CA 94016 9
186848 09/19/19 17:30 511 Forest St, San Francisco, CA 94016 9
186849 09/30/19 00:18 250 Meadow St, San Francisco, CA 94016 9

Sales City
0 23.90 Dallas (TX)
2 99.99 Boston (MA)
3 600.00 Los Angeles (CA)
4 11.99 Los Angeles (CA)
5 11.99 Los Angeles (CA)
… … …
186845 8.97 Los Angeles (CA)
186846 700.00 San Francisco (CA)
186847 700.00 San Francisco (CA)
186848 379.99 San Francisco (CA)
186849 11.95 San Francisco (CA)

[185950 rows x 9 columns]

0.0.5 Which City had the Highest Sales?

[16]: results = all_data.groupby("City").sum()

[17]: cities = all_data["City"].unique()


plt.bar(cities, results["Sales"])
plt.xticks(cities, rotation=45, size=8)
plt.xlabel("City Name")
plt.ylabel("Sales in USD ($)")
plt.show()

6
0.0.6 What time should we display advertisements to maximize the likelihood of
customers buying the product?

[18]: all_data["Order Date"] = pd.to_datetime(all_data["Order Date"]) # convert␣


↪datetime format

C:\Users\DELL\AppData\Local\Temp\ipykernel_9116\2228339044.py:1: UserWarning:
Could not infer format, so each element will be parsed individually, falling
back to `dateutil`. To ensure parsing is consistent and as-expected, please
specify a format.
all_data["Order Date"] = pd.to_datetime(all_data["Order Date"]) # convert
datetime format

[19]: all_data.head()

7
[19]: Order ID Product Quantity Ordered Price Each \
0 176558 USB-C Charging Cable 2 11.95
2 176559 Bose SoundSport Headphones 1 99.99
3 176560 Google Phone 1 600.00
4 176560 Wired Headphones 1 11.99
5 176561 Wired Headphones 1 11.99

Order Date Purchase Address Month Sales \


0 2019-04-19 08:46:00 917 1st St, Dallas, TX 75001 4 23.90
2 2019-04-07 22:30:00 682 Chestnut St, Boston, MA 02215 4 99.99
3 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00
4 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99
5 2019-04-30 09:27:00 333 8th St, Los Angeles, CA 90001 4 11.99

City
0 Dallas (TX)
2 Boston (MA)
3 Los Angeles (CA)
4 Los Angeles (CA)
5 Los Angeles (CA)

[20]: all_data["Hour"] = all_data["Order Date"].dt.hour


all_data["Minute"] = all_data["Order Date"].dt.minute

[21]: all_data.head()

[21]: Order ID Product Quantity Ordered Price Each \


0 176558 USB-C Charging Cable 2 11.95
2 176559 Bose SoundSport Headphones 1 99.99
3 176560 Google Phone 1 600.00
4 176560 Wired Headphones 1 11.99
5 176561 Wired Headphones 1 11.99

Order Date Purchase Address Month Sales \


0 2019-04-19 08:46:00 917 1st St, Dallas, TX 75001 4 23.90
2 2019-04-07 22:30:00 682 Chestnut St, Boston, MA 02215 4 99.99
3 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00
4 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99
5 2019-04-30 09:27:00 333 8th St, Los Angeles, CA 90001 4 11.99

City Hour Minute


0 Dallas (TX) 8 46
2 Boston (MA) 22 30
3 Los Angeles (CA) 14 38
4 Los Angeles (CA) 14 38
5 Los Angeles (CA) 9 27

8
[22]: hours = sorted(all_data["Hour"].unique())
plt.plot(hours, all_data.groupby(["Hour"]).count())
plt.xticks(hours)
plt.grid()
plt.xlabel("Hour")
plt.ylabel("Count of Orders")
plt.show()

#Highest number of orders came on 11 AM & 7 PM (11)

0.0.7 What products sold the most? Why do you think it sold the most?

[23]: product_group = all_data.groupby("Product")


quantity_order = product_group["Quantity Ordered"].sum()
products = [product for product, f in product_group]
plt.bar(products,quantity_order)
plt.ylabel("Quantity Ordered")
plt.xlabel("Product")
plt.xticks(products, rotation="vertical", size=8)
plt.show()

9
[44]: price = all_data.groupby("Product")["Price Each"].mean()

fig,ax1 = plt.subplots()

ax2 = ax1.twinx() # share x axis and create second y axis,␣


↪useful when we show different y axis in common x axis

ax1.bar(products,quantity_order, color="green")
ax2.plot(products,price)

ax1.set_xlabel("Product Name")
ax1.set_ylabel("Quantity Ordered", color="green")

10
ax1.set_xticklabels(products, rotation="vertical", size=8)
plt.title("Quantity Ordered vs Average Price for Each Product")

plt.show()

C:\Users\DELL\AppData\Local\Temp\ipykernel_9116\836746615.py:13: UserWarning:
set_ticklabels() should only be used with a fixed number of ticks, i.e. after
set_ticks() or using a FixedLocator.
ax1.set_xticklabels(products, rotation="vertical", size=8)

#The products with lower prices sold more, while those with higher prices sold less.

11

You might also like