Case Study: Analyze Sales: Clean Up Data
Case Study: Analyze Sales: Clean Up Data
1. What was the best month for sales? How much was earned that month?
2. What city sold the most product?
3. What time should we display advertisements to maximize the likelihood of a customer’s buying a product?
4. What products are most often sold together?
5. What product sold the most? Why do you think it sold the most?
Import libraries
In [93]: import pandas as pd
import os
import matplotlib.pyplot as plt
all_months_data = pd.DataFrame()
for file in files:
df = pd.read_csv("C:/Users/juand/OneDrive/Escritorio/Sales_Data/" + file)
all_months_data = pd.concat([all_months_data, df])
Out[95]: Order ID Product Quantity Ordered Price Each Order Date Purchase Address
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215
3 176560 Google Phone 1 600 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001
6 176562 USB-C Charging Cable 1 11.95 04/29/19 13:03 381 Wilson St, San Francisco, CA 94016
7 176563 Bose SoundSport Headphones 1 99.99 04/02/19 07:46 668 Center St, Seattle, WA 98101
8 176564 USB-C Charging Cable 1 11.95 04/12/19 10:58 790 Ridge St, Atlanta, GA 30301
9 176565 Macbook Pro Laptop 1 1700 04/24/19 10:38 915 Willow St, San Francisco, CA 94016
Clean up data
Drop rows of NaN
In [96]: nan_df = all_data[all_data.isna().any(axis=1)]
nan_df.head()
Out[96]: Order ID Product Quantity Ordered Price Each Order Date Purchase Address
Out[97]: Order ID Product Quantity Ordered Price Each Order Date Purchase Address
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215
3 176560 Google Phone 1 600 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001
In [ ]:
Questions 1. What was the best month for sales? How much was earned that month?
To add new column for knowing month's number
Out[100… Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4
3 176560 Google Phone 1 600.00 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4
Out[101… Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4 23.90
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4 99.99
3 176560 Google Phone 1 600.00 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 600.00
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 11.99
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4 11.99
Answer 1:
The best month for sales was december.
It earned at least 4 million dolars
def get_state(address):
return address.split(',')[2].split(" ")[1]
all_data['City'] = all_data['Purchase Address'].apply(lambda x: f'{get_city(x)} ({get_state(x)})')
all_data.head()
Out[104… Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City
0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX)
2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA)
3 176560 Google Phone 1 600.00 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA)
4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA)
5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA)
City
plt.bar(cities, results['Sales'])
plt.xticks(cities, rotation = 'vertical', size = 12)
plt.ylabel('Sales in USD')
plt.xlabel('Cities')
plt.show()
Answer 2:
The city that most sold products was San Francisto (CA)
Question 3. What time should we display advertisements to maximize the likelihood of a customer’s buying a product?
Out[108… Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City Hour Minute
0 176558 USB-C Charging Cable 2 11.95 2019-04-19 08:46:00 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX) 8 46
2 176559 Bose SoundSport Headphones 1 99.99 2019-04-07 22:30:00 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA) 22 30
3 176560 Google Phone 1 600.00 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) 14 38
4 176560 Wired Headphones 1 11.99 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) 14 38
5 176561 Wired Headphones 1 11.99 2019-04-30 09:27:00 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) 9 27
plt.plot(hours, all_data.groupby(['Hour']).count())
plt.xticks(hours, size = 12)
plt.xlabel('Hours')
plt.ylabel('Sales')
plt.grid()
plt.show()
The best hours for advertising to maximize the likelihood of a customer's buying a product are 12:00 and 19:00.
Out[110… Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City Hour Minute
0 176558 USB-C Charging Cable 2 11.95 2019-04-19 08:46:00 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX) 8 46
2 176559 Bose SoundSport Headphones 1 99.99 2019-04-07 22:30:00 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA) 22 30
3 176560 Google Phone 1 600.00 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) 14 38
4 176560 Wired Headphones 1 11.99 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) 14 38
5 176561 Wired Headphones 1 11.99 2019-04-30 09:27:00 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) 9 27
As you can see, the Order ID is repeated in some cases. For example, Google Phone (row number 3) and Wired Headphones (row number 4) have the same code. So, someone made the order
with those products at the same time.
In [111… # https://stackoverflow.com/questions/43348194/pandas-select-rows-if-id-appear-several-time
df = all_data[all_data['Order ID'].duplicated(keep=False)]
df.head(20)
Out[111… Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City Hour Minute
3 176560 Google Phone 1 600.00 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) 14 38
4 176560 Wired Headphones 1 11.99 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) 14 38
18 176574 Google Phone 1 600.00 2019-04-03 19:42:00 20 Hill St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) 19 42
19 176574 USB-C Charging Cable 1 11.95 2019-04-03 19:42:00 20 Hill St, Los Angeles, CA 90001 4 11.95 Los Angeles (CA) 19 42
30 176585 Bose SoundSport Headphones 1 99.99 2019-04-07 11:31:00 823 Highland St, Boston, MA 02215 4 99.99 Boston (MA) 11 31
31 176585 Bose SoundSport Headphones 1 99.99 2019-04-07 11:31:00 823 Highland St, Boston, MA 02215 4 99.99 Boston (MA) 11 31
32 176586 AAA Batteries (4-pack) 2 2.99 2019-04-10 17:00:00 365 Center St, San Francisco, CA 94016 4 5.98 San Francisco (CA) 17 0
33 176586 Google Phone 1 600.00 2019-04-10 17:00:00 365 Center St, San Francisco, CA 94016 4 600.00 San Francisco (CA) 17 0
119 176672 Lightning Charging Cable 1 14.95 2019-04-12 11:07:00 778 Maple St, New York City, NY 10001 4 14.95 New York City (NY) 11 7
120 176672 USB-C Charging Cable 1 11.95 2019-04-12 11:07:00 778 Maple St, New York City, NY 10001 4 11.95 New York City (NY) 11 7
129 176681 Apple Airpods Headphones 1 150.00 2019-04-20 10:39:00 331 Cherry St, Seattle, WA 98101 4 150.00 Seattle (WA) 10 39
130 176681 ThinkPad Laptop 1 999.99 2019-04-20 10:39:00 331 Cherry St, Seattle, WA 98101 4 999.99 Seattle (WA) 10 39
138 176689 Bose SoundSport Headphones 1 99.99 2019-04-24 17:15:00 659 Lincoln St, New York City, NY 10001 4 99.99 New York City (NY) 17 15
139 176689 AAA Batteries (4-pack) 2 2.99 2019-04-24 17:15:00 659 Lincoln St, New York City, NY 10001 4 5.98 New York City (NY) 17 15
189 176739 34in Ultrawide Monitor 1 379.99 2019-04-05 17:38:00 730 6th St, Austin, TX 73301 4 379.99 Austin (TX) 17 38
190 176739 Google Phone 1 600.00 2019-04-05 17:38:00 730 6th St, Austin, TX 73301 4 600.00 Austin (TX) 17 38
225 176774 Lightning Charging Cable 1 14.95 2019-04-25 15:06:00 372 Church St, Los Angeles, CA 90001 4 14.95 Los Angeles (CA) 15 6
226 176774 USB-C Charging Cable 1 11.95 2019-04-25 15:06:00 372 Church St, Los Angeles, CA 90001 4 11.95 Los Angeles (CA) 15 6
233 176781 iPhone 1 700.00 2019-04-03 07:37:00 976 Hickory St, Dallas, TX 75001 4 700.00 Dallas (TX) 7 37
234 176781 Lightning Charging Cable 1 14.95 2019-04-03 07:37:00 976 Hickory St, Dallas, TX 75001 4 14.95 Dallas (TX) 7 37
Check out the DataFrame. We have only the Order ID that is repeated.
In [112… # https://stackoverflow.com/questions/27298178/concatenate-strings-from-several-rows-using-pandas-groupby
<ipython-input-112-5d4ac7236136>:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
count = Counter()
Question 5. What product sold the most? Why do you think it sold the most?
In [114… all_data.head()
Out[114… Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City Hour Minute
0 176558 USB-C Charging Cable 2 11.95 2019-04-19 08:46:00 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX) 8 46
2 176559 Bose SoundSport Headphones 1 99.99 2019-04-07 22:30:00 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA) 22 30
3 176560 Google Phone 1 600.00 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) 14 38
4 176560 Wired Headphones 1 11.99 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) 14 38
5 176561 Wired Headphones 1 11.99 2019-04-30 09:27:00 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) 9 27
Product
plt.bar(products, quantity_ordered)
plt.xticks(products, rotation = 'vertical', size = 12)
plt.ylabel('Quantity Ordered')
plt.xlabel('Products')
plt.show()
ax2 = ax1.twinx()
ax1.bar(products, quantity_ordered, color='g')
ax2.plot(products, prices, color='b')
ax1.set_xlabel('Product Name')
ax1.set_ylabel('Quantity Ordered', color='g')
ax2.set_ylabel('Price ($)', color='b')
ax1.set_xticklabels(products, rotation='vertical', size=12)
fig.show()
Product
20in Monitor 109.99
27in 4K Gaming Monitor 389.99
27in FHD Monitor 149.99
34in Ultrawide Monitor 379.99
AA Batteries (4-pack) 3.84
AAA Batteries (4-pack) 2.99
Apple Airpods Headphones 150.00
Bose SoundSport Headphones 99.99
Flatscreen TV 300.00
Google Phone 600.00
LG Dryer 600.00
LG Washing Machine 600.00
Lightning Charging Cable 14.95
Macbook Pro Laptop 1700.00
ThinkPad Laptop 999.99
USB-C Charging Cable 11.95
Vareebadd Phone 400.00
Wired Headphones 11.99
iPhone 700.00
Name: Price Each, dtype: float64
<ipython-input-124-8f281a0f9d60>:13: UserWarning: FixedFormatter should only be used together with FixedLocator
ax1.set_xticklabels(products, rotation='vertical', size=12)
<ipython-input-124-8f281a0f9d60>:15: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot
show the figure.
fig.show()
Line blue contains the price and the green bars are the quantity ordered for each product. As we can analyze the graphic, if the amount of product is less, the price increase and vice versa. For
example, we can think that a lot of people are buying a MacBook Pro, in this case, the Mac is in demand, and for that reason its price increase much more than others products less demanded.
In [ ]:
In [ ]: