0% found this document useful (0 votes)
121 views1 page

Case Study: Analyze Sales: Clean Up Data

This case study analyzes sales data to answer 5 business questions. The document discusses cleaning the data by dropping rows with missing values, converting columns to the correct data types, and extracting city and state from addresses. Sales data is then summed by month to identify the best month for sales. The document generates a bar plot to visualize monthly sales.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views1 page

Case Study: Analyze Sales: Clean Up Data

This case study analyzes sales data to answer 5 business questions. The document discusses cleaning the data by dropping rows with missing values, converting columns to the correct data types, and extracting city and state from addresses. Sales data is then summed by month to identify the best month for sales. The document generates a bar plot to visualize monthly sales.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Case study: Analyze Sales

By: Juan David Serna Valderrama

Some skills that we are going to use in this case study:


Drop NaN values from DataFrame
Removing rows based on a condition
Change the type of columns (to_numeric, to_datetime, astype)

To explore 5 high-level business questions related to our data:

1. What was the best month for sales? How much was earned that month?
2. What city sold the most product?
3. What time should we display advertisements to maximize the likelihood of a customer’s buying a product?
4. What products are most often sold together?
5. What product sold the most? Why do you think it sold the most?

Import libraries
In [93]: import pandas as pd
import os
import matplotlib.pyplot as plt

We have 12 files.csv. To join in a single file


In [94]: df = pd.read_csv("C:/Users/juand/OneDrive/Escritorio/Sales_Data/Sales_April_2019.csv")

files = [file for file in os.listdir("C:/Users/juand/OneDrive/Escritorio/Sales_Data")]

all_months_data = pd.DataFrame()
for file in files:
df = pd.read_csv("C:/Users/juand/OneDrive/Escritorio/Sales_Data/" + file)
all_months_data = pd.concat([all_months_data, df])

all_months_data.to_csv("all_data.csv", index= False)

Read new file (updated)


In [95]: all_data = pd.read_csv("C:/Users/juand/OneDrive/Escritorio/all_data.csv")
all_data.head(10)

Out[95]: Order ID Product Quantity Ordered Price Each Order Date Purchase Address

0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001

1 NaN NaN NaN NaN NaN NaN

2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215

3 176560 Google Phone 1 600 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001

4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001

5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001

6 176562 USB-C Charging Cable 1 11.95 04/29/19 13:03 381 Wilson St, San Francisco, CA 94016

7 176563 Bose SoundSport Headphones 1 99.99 04/02/19 07:46 668 Center St, Seattle, WA 98101

8 176564 USB-C Charging Cable 1 11.95 04/12/19 10:58 790 Ridge St, Atlanta, GA 30301

9 176565 Macbook Pro Laptop 1 1700 04/24/19 10:38 915 Willow St, San Francisco, CA 94016

Clean up data
Drop rows of NaN
In [96]: nan_df = all_data[all_data.isna().any(axis=1)]
nan_df.head()

Out[96]: Order ID Product Quantity Ordered Price Each Order Date Purchase Address

1 NaN NaN NaN NaN NaN NaN

356 NaN NaN NaN NaN NaN NaN

735 NaN NaN NaN NaN NaN NaN

1433 NaN NaN NaN NaN NaN NaN

1553 NaN NaN NaN NaN NaN NaN

In [97]: all_data = all_data.dropna(how='all')


all_data.head()

Out[97]: Order ID Product Quantity Ordered Price Each Order Date Purchase Address

0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001

2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215

3 176560 Google Phone 1 600 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001

4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001

5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001

Find 'Or' and delete it


In [98]: #temp_df = all_data[all_data[condition]]
all_data = all_data[all_data['Order Date'].str[0:2] != 'Or']

To convert column with the correct type


In [99]: # all_data['Quantity Ordered'] = Make 'int'
all_data['Quantity Ordered'] = pd.to_numeric(all_data['Quantity Ordered'])

# all_data['Price Each'] = Make 'float'


all_data['Price Each'] = pd.to_numeric(all_data['Price Each'])

In [ ]:

Questions 1. What was the best month for sales? How much was earned that month?
To add new column for knowing month's number

In [100… all_data['Month'] = all_data['Order Date'].str[0:2]


all_data['Month'] = all_data['Month'].astype('int32')
all_data.head()

Out[100… Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month

0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4

2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4

3 176560 Google Phone 1 600.00 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4

4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4

5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4

To add a sales column

In [101… all_data['Sales'] = all_data['Quantity Ordered'] * all_data['Price Each']


all_data.head()

Out[101… Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales

0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4 23.90

2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4 99.99

3 176560 Google Phone 1 600.00 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 600.00

4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 11.99

5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4 11.99

Sum per Month

In [102… results = all_data.groupby('Month').sum()

Show up with matployplib.pyplot

In [103… months = range(1, 13)


plt.bar(months, results['Sales'])
plt.xticks(months)
plt.ylabel('Sales in USD')
plt.xlabel('Month')
plt.show()

Answer 1:
The best month for sales was december.
It earned at least 4 million dolars

Question 2. What city sold the most product?

To add a City column

To use .apply() for extracting caracteres

In [104… def get_city(address):


return address.split(',')[1]

def get_state(address):
return address.split(',')[2].split(" ")[1]
all_data['City'] = all_data['Purchase Address'].apply(lambda x: f'{get_city(x)} ({get_state(x)})')
all_data.head()

Out[104… Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City

0 176558 USB-C Charging Cable 2 11.95 04/19/19 08:46 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX)

2 176559 Bose SoundSport Headphones 1 99.99 04/07/19 22:30 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA)

3 176560 Google Phone 1 600.00 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA)

4 176560 Wired Headphones 1 11.99 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA)

5 176561 Wired Headphones 1 11.99 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA)

In [105… results = all_data.groupby('City').sum()


results

Out[105… Quantity Ordered Price Each Month Sales

City

Atlanta (GA) 16602 2.779908e+06 104794 2.795499e+06

Austin (TX) 11153 1.809874e+06 69829 1.819582e+06

Boston (MA) 22528 3.637410e+06 141112 3.661642e+06

Dallas (TX) 16730 2.752628e+06 104620 2.767975e+06

Los Angeles (CA) 33289 5.421435e+06 208325 5.452571e+06

New York City (NY) 27932 4.635371e+06 175741 4.664317e+06

Portland (ME) 2750 4.471893e+05 17144 4.497583e+05

Portland (OR) 11303 1.860558e+06 70621 1.870732e+06

San Francisco (CA) 50239 8.211462e+06 315520 8.262204e+06

Seattle (WA) 16553 2.733296e+06 104941 2.747755e+06

In [106… cities = [city for city, df, in all_data.groupby('City')]

plt.bar(cities, results['Sales'])
plt.xticks(cities, rotation = 'vertical', size = 12)
plt.ylabel('Sales in USD')
plt.xlabel('Cities')
plt.show()

Answer 2:
The city that most sold products was San Francisto (CA)

Question 3. What time should we display advertisements to maximize the likelihood of a customer’s buying a product?

To convert Order Date using datetime

In [107… all_data['Order Date'] = pd.to_datetime(all_data['Order Date'])

In [108… all_data['Hour'] = all_data['Order Date'].dt.hour


all_data['Minute'] = all_data['Order Date'].dt.minute
all_data.head()

Out[108… Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City Hour Minute

0 176558 USB-C Charging Cable 2 11.95 2019-04-19 08:46:00 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX) 8 46

2 176559 Bose SoundSport Headphones 1 99.99 2019-04-07 22:30:00 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA) 22 30

3 176560 Google Phone 1 600.00 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) 14 38

4 176560 Wired Headphones 1 11.99 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) 14 38

5 176561 Wired Headphones 1 11.99 2019-04-30 09:27:00 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) 9 27

In [109… hours = [hour for hour, df in all_data.groupby('Hour')]

plt.plot(hours, all_data.groupby(['Hour']).count())
plt.xticks(hours, size = 12)
plt.xlabel('Hours')
plt.ylabel('Sales')
plt.grid()
plt.show()

The best hours for advertising to maximize the likelihood of a customer's buying a product are 12:00 and 19:00.

Question 4. What products are most often sold together?


In [110… all_data.head()

Out[110… Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City Hour Minute

0 176558 USB-C Charging Cable 2 11.95 2019-04-19 08:46:00 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX) 8 46

2 176559 Bose SoundSport Headphones 1 99.99 2019-04-07 22:30:00 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA) 22 30

3 176560 Google Phone 1 600.00 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) 14 38

4 176560 Wired Headphones 1 11.99 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) 14 38

5 176561 Wired Headphones 1 11.99 2019-04-30 09:27:00 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) 9 27

As you can see, the Order ID is repeated in some cases. For example, Google Phone (row number 3) and Wired Headphones (row number 4) have the same code. So, someone made the order
with those products at the same time.

In [111… # https://stackoverflow.com/questions/43348194/pandas-select-rows-if-id-appear-several-time

df = all_data[all_data['Order ID'].duplicated(keep=False)]
df.head(20)

Out[111… Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City Hour Minute

3 176560 Google Phone 1 600.00 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) 14 38

4 176560 Wired Headphones 1 11.99 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) 14 38

18 176574 Google Phone 1 600.00 2019-04-03 19:42:00 20 Hill St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) 19 42

19 176574 USB-C Charging Cable 1 11.95 2019-04-03 19:42:00 20 Hill St, Los Angeles, CA 90001 4 11.95 Los Angeles (CA) 19 42

30 176585 Bose SoundSport Headphones 1 99.99 2019-04-07 11:31:00 823 Highland St, Boston, MA 02215 4 99.99 Boston (MA) 11 31

31 176585 Bose SoundSport Headphones 1 99.99 2019-04-07 11:31:00 823 Highland St, Boston, MA 02215 4 99.99 Boston (MA) 11 31

32 176586 AAA Batteries (4-pack) 2 2.99 2019-04-10 17:00:00 365 Center St, San Francisco, CA 94016 4 5.98 San Francisco (CA) 17 0

33 176586 Google Phone 1 600.00 2019-04-10 17:00:00 365 Center St, San Francisco, CA 94016 4 600.00 San Francisco (CA) 17 0

119 176672 Lightning Charging Cable 1 14.95 2019-04-12 11:07:00 778 Maple St, New York City, NY 10001 4 14.95 New York City (NY) 11 7

120 176672 USB-C Charging Cable 1 11.95 2019-04-12 11:07:00 778 Maple St, New York City, NY 10001 4 11.95 New York City (NY) 11 7

129 176681 Apple Airpods Headphones 1 150.00 2019-04-20 10:39:00 331 Cherry St, Seattle, WA 98101 4 150.00 Seattle (WA) 10 39

130 176681 ThinkPad Laptop 1 999.99 2019-04-20 10:39:00 331 Cherry St, Seattle, WA 98101 4 999.99 Seattle (WA) 10 39

138 176689 Bose SoundSport Headphones 1 99.99 2019-04-24 17:15:00 659 Lincoln St, New York City, NY 10001 4 99.99 New York City (NY) 17 15

139 176689 AAA Batteries (4-pack) 2 2.99 2019-04-24 17:15:00 659 Lincoln St, New York City, NY 10001 4 5.98 New York City (NY) 17 15

189 176739 34in Ultrawide Monitor 1 379.99 2019-04-05 17:38:00 730 6th St, Austin, TX 73301 4 379.99 Austin (TX) 17 38

190 176739 Google Phone 1 600.00 2019-04-05 17:38:00 730 6th St, Austin, TX 73301 4 600.00 Austin (TX) 17 38

225 176774 Lightning Charging Cable 1 14.95 2019-04-25 15:06:00 372 Church St, Los Angeles, CA 90001 4 14.95 Los Angeles (CA) 15 6

226 176774 USB-C Charging Cable 1 11.95 2019-04-25 15:06:00 372 Church St, Los Angeles, CA 90001 4 11.95 Los Angeles (CA) 15 6

233 176781 iPhone 1 700.00 2019-04-03 07:37:00 976 Hickory St, Dallas, TX 75001 4 700.00 Dallas (TX) 7 37

234 176781 Lightning Charging Cable 1 14.95 2019-04-03 07:37:00 976 Hickory St, Dallas, TX 75001 4 14.95 Dallas (TX) 7 37

Check out the DataFrame. We have only the Order ID that is repeated.

In [112… # https://stackoverflow.com/questions/27298178/concatenate-strings-from-several-rows-using-pandas-groupby

df['Grouped'] = df.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))


df2 = df[['Order ID', 'Grouped']].drop_duplicates()
df2.head()

<ipython-input-112-5d4ac7236136>:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


df['Grouped'] = df.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
Out[112… Order ID Grouped

3 176560 Google Phone,Wired Headphones

18 176574 Google Phone,USB-C Charging Cable

30 176585 Bose SoundSport Headphones,Bose SoundSport Hea...

32 176586 AAA Batteries (4-pack),Google Phone

119 176672 Lightning Charging Cable,USB-C Charging Cable

In [113… # Referenced: https://stackoverflow.com/questions/52195887/counting-unique-pairs-of-numbers-into-a-python-dictionary


from itertools import combinations
from collections import Counter

count = Counter()

for row in df2['Grouped']:


row_list = row.split(',')
count.update(Counter(combinations(row_list, 2)))

for key,value in count.most_common(10):


print(key, value)

('iPhone', 'Lightning Charging Cable') 1005


('Google Phone', 'USB-C Charging Cable') 987
('iPhone', 'Wired Headphones') 447
('Google Phone', 'Wired Headphones') 414
('Vareebadd Phone', 'USB-C Charging Cable') 361
('iPhone', 'Apple Airpods Headphones') 360
('Google Phone', 'Bose SoundSport Headphones') 220
('USB-C Charging Cable', 'Wired Headphones') 160
('Vareebadd Phone', 'Wired Headphones') 143
('Lightning Charging Cable', 'Wired Headphones') 92

Question 5. What product sold the most? Why do you think it sold the most?
In [114… all_data.head()

Out[114… Order ID Product Quantity Ordered Price Each Order Date Purchase Address Month Sales City Hour Minute

0 176558 USB-C Charging Cable 2 11.95 2019-04-19 08:46:00 917 1st St, Dallas, TX 75001 4 23.90 Dallas (TX) 8 46

2 176559 Bose SoundSport Headphones 1 99.99 2019-04-07 22:30:00 682 Chestnut St, Boston, MA 02215 4 99.99 Boston (MA) 22 30

3 176560 Google Phone 1 600.00 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 600.00 Los Angeles (CA) 14 38

4 176560 Wired Headphones 1 11.99 2019-04-12 14:38:00 669 Spruce St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) 14 38

5 176561 Wired Headphones 1 11.99 2019-04-30 09:27:00 333 8th St, Los Angeles, CA 90001 4 11.99 Los Angeles (CA) 9 27

In [115… product_group = all_data.groupby('Product')


product_group.sum()

Out[115… Quantity Ordered Price Each Month Sales Hour Minute

Product

20in Monitor 4129 451068.99 29336 454148.71 58764 122252

27in 4K Gaming Monitor 6244 2429637.70 44440 2435097.56 90916 184331

27in FHD Monitor 7550 1125974.93 52558 1132424.50 107540 219948

34in Ultrawide Monitor 6199 2348718.19 43304 2355558.01 89076 183480

AA Batteries (4-pack) 27635 79015.68 145558 106118.40 298342 609039

AAA Batteries (4-pack) 31017 61716.59 146370 92740.83 297332 612113

Apple Airpods Headphones 15661 2332350.00 109477 2349150.00 223304 455570

Bose SoundSport Headphones 13457 1332366.75 94113 1345565.43 192445 392603

Flatscreen TV 4819 1440000.00 34224 1445700.00 68815 142789

Google Phone 5532 3315000.00 38305 3319200.00 79479 162773

LG Dryer 646 387600.00 4383 387600.00 9326 19043

LG Washing Machine 666 399600.00 4523 399600.00 9785 19462

Lightning Charging Cable 23217 323787.10 153092 347094.15 312529 634442

Macbook Pro Laptop 4728 8030800.00 33548 8037600.00 68261 137574

ThinkPad Laptop 4130 4127958.72 28950 4129958.70 59746 121508

USB-C Charging Cable 23975 261740.85 154819 286501.25 314645 647586

Vareebadd Phone 2068 826000.00 14309 827200.00 29472 61835

Wired Headphones 20557 226395.18 133397 246478.43 271720 554023

iPhone 6849 4789400.00 47941 4794300.00 98657 201688

In [117… product_group = all_data.groupby('Product')


quantity_ordered = product_group.sum()['Quantity Ordered']

products = [product for product, df in product_group]

plt.bar(products, quantity_ordered)
plt.xticks(products, rotation = 'vertical', size = 12)
plt.ylabel('Quantity Ordered')
plt.xlabel('Products')
plt.show()

We are going to do a deeper analysis about quantity oredered using statistics.

In [124… prices = all_data.groupby('Product').mean()['Price Each']


print(prices)

fig, ax1 = plt.subplots()

ax2 = ax1.twinx()
ax1.bar(products, quantity_ordered, color='g')
ax2.plot(products, prices, color='b')

ax1.set_xlabel('Product Name')
ax1.set_ylabel('Quantity Ordered', color='g')
ax2.set_ylabel('Price ($)', color='b')
ax1.set_xticklabels(products, rotation='vertical', size=12)

fig.show()

Product
20in Monitor 109.99
27in 4K Gaming Monitor 389.99
27in FHD Monitor 149.99
34in Ultrawide Monitor 379.99
AA Batteries (4-pack) 3.84
AAA Batteries (4-pack) 2.99
Apple Airpods Headphones 150.00
Bose SoundSport Headphones 99.99
Flatscreen TV 300.00
Google Phone 600.00
LG Dryer 600.00
LG Washing Machine 600.00
Lightning Charging Cable 14.95
Macbook Pro Laptop 1700.00
ThinkPad Laptop 999.99
USB-C Charging Cable 11.95
Vareebadd Phone 400.00
Wired Headphones 11.99
iPhone 700.00
Name: Price Each, dtype: float64
<ipython-input-124-8f281a0f9d60>:13: UserWarning: FixedFormatter should only be used together with FixedLocator
ax1.set_xticklabels(products, rotation='vertical', size=12)
<ipython-input-124-8f281a0f9d60>:15: UserWarning: Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot
show the figure.
fig.show()

Line blue contains the price and the green bars are the quantity ordered for each product. As we can analyze the graphic, if the amount of product is less, the price increase and vice versa. For
example, we can think that a lot of people are buying a MacBook Pro, in this case, the Mac is in demand, and for that reason its price increase much more than others products less demanded.

In [ ]:

In [ ]:

You might also like