Sales Data Analysis
Sales Data Analysis
0.1 Pandas
Pandas is a powerful Python library for data manipulation and analysis, and it plays a crucial role
in sales analysis.
1
1. Data Cleaning and Preprocessing: Pandas provides functions for handling missing data,
removing duplicates, and transforming data. We can use pandas to clean and preprocess sales
data, ensuring that it is accurate and consistent before analysis.
2. Data Manipulation: Pandas offers powerful tools for data manipulation, such as filtering,
sorting, grouping, and aggregating data. We can use pandas to perform calculations, calculate
summary statistics, and reshape data to extract meaningful insights from sales data.
3. Time Series Analysis: Pandas has built-in support for time series data, making it easy to
analyze sales data over time. We can use pandas to resample time series data, calculate rolling
statistics, and perform date/time-based operations to understand sales trends and patterns.
4. Data Visualization Integration: Pandas seamlessly integrates with data visualization
libraries like Matplotlib and Seaborn, allowing to create insightful visualizations of sales
data.
5. Data Merging and Joining: Pandas provides functions for merging and joining multiple
datasets based on common keys or indices. This capability allows to combine sales data with
other relevant datasets, such as customer data or product data, to perform more comprehen-
sive analysis and gain deeper insights into sales performance.
2
pio.templates.default = "plotly_white"
[5 rows x 21 columns]
3
51288 MX-2014-114783 2014-12-31 2015-01-06 Standard Class Tamara Dahlen
51289 CA-2014-156720 2014-12-31 2015-01-04 Standard Class Jill Matthias
category sub_category \
51285 Office Supplies Binders
51286 Office Supplies Binders
51287 Office Supplies Labels
51288 Office Supplies Labels
51289 Office Supplies Fasteners
[5 rows x 21 columns]
4
[9]: # A concise summary of the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51290 entries, 0 to 51289
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 51290 non-null object
1 order_date 51290 non-null datetime64[ns]
2 ship_date 51290 non-null datetime64[ns]
3 ship_mode 51290 non-null object
4 customer_name 51290 non-null object
5 segment 51290 non-null object
6 state 51290 non-null object
7 country 51290 non-null object
8 market 51290 non-null object
9 region 51290 non-null object
10 product_id 51290 non-null object
11 category 51290 non-null object
12 sub_category 51290 non-null object
13 product_name 51290 non-null object
14 sales 51290 non-null float64
15 quantity 51290 non-null int64
16 discount 51290 non-null float64
17 profit 51290 non-null float64
18 shipping_cost 51290 non-null float64
19 order_priority 51290 non-null object
20 year 51290 non-null int64
dtypes: datetime64[ns](2), float64(4), int64(2), object(13)
memory usage: 8.2+ MB
[10]: order_id 0
order_date 0
ship_date 0
ship_mode 0
customer_name 0
segment 0
state 0
country 0
market 0
region 0
product_id 0
category 0
5
sub_category 0
product_name 0
sales 0
quantity 0
discount 0
profit 0
shipping_cost 0
order_priority 0
year 0
dtype: int64
[13]: print(df['month_year'].unique())
print(df['month_year'].dtype)
6
'2013-05' '2013-06' '2013-07' '2013-08' '2013-09' '2013-10' '2013-11'
'2013-12' '2014-01' '2014-02' '2014-03' '2014-04' '2014-05' '2014-06'
'2014-07' '2014-08' '2014-09' '2014-10' '2014-11' '2014-12']
object
fig = px.pie(sales_by_category,
values='sales',
names='category',
hole=0.5,
7
color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(title_text='Sales Analysis by Category',␣
↪title_font=dict(size=24))
fig.show()
fig.show()
5. PROFIT BY CATEGORY
[23]: profit_by_category = df.groupby('category')['profit'].sum().reset_index()
fig = px.pie(profit_by_category,
values='profit',
names='category',
hole=0.5,
color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(title_text='Profit Analysis by Category',␣
↪title_font=dict(size=24))
fig.show()
8
y='profit',
title='Profit Analysis by Sub-Category')
fig.show()
color_palette = colors.qualitative.Pastel
fig = go.Figure()
fig.add_trace(go.Bar(x=sales_profit_by_segment['segment'],
y=sales_profit_by_segment['sales'],
name='Sales',
marker_color=color_palette[0]))
fig.add_trace(go.Bar(x=sales_profit_by_segment['segment'],
y=sales_profit_by_segment['profit'],
name='Profit',
marker_color=color_palette[1]))
fig.show()
sales_profit_by_segment['Sales_to_Profit_Ratio'] =␣
↪sales_profit_by_segment['sales'] / sales_profit_by_segment['profit']
print(sales_profit_by_segment[['segment', 'Sales_to_Profit_Ratio']])
segment Sales_to_Profit_Ratio
0 Consumer 8.686070
1 Corporate 8.637804
2 Home Office 8.338550
• The store has higher profits from the product sales for consumers.
9. WHICH ARE THE TOP 10 PRODUCTS BY SALES?
[29]: # Grouping products by sales
prod_sales = pd.DataFrame(df.groupby('product_name')['sales'].sum())
9
# Top 10 products by sales
prod_sales[:10]
[29]: sales
product_name
Apple Smart Phone, Full Size 86935.7786
Cisco Smart Phone, Full Size 76441.5306
Motorola Smart Phone, Full Size 73156.3030
Nokia Smart Phone, Full Size 71904.5555
Canon imageCLASS 2200 Advanced Copier 61599.8240
Hon Executive Leather Armchair, Adjustable 58193.4841
Office Star Executive Leather Armchair, Adjustable 50661.6840
Harbour Creations Executive Leather Armchair, A… 50121.5160
Samsung Smart Phone, Cordless 48653.4600
Nokia Smart Phone, with Caller ID 47877.7857
[30]: quantity
product_name
Staples 876
Cardinal Index Tab, Clear 337
Eldon File Cart, Single Width 321
Rogers File Cart, Single Width 262
Sanford Pencil Sharpener, Water Color 259
Stockwell Paper Clips, Assorted Sizes 253
Avery Index Tab, Clear 252
Ibico Index Tab, Clear 251
Smead File Cart, Single Width 250
Stanley Pencil Sharpener, Water Color 242
10
# countplot: Show the counts of observations in each categorical bin using bars
sns.countplot(x='ship_mode', data=df)
[31]: profit
category sub_category
11
Technology Copiers 258567.54818
Phones 216717.00580
Accessories 129626.30620
Machines 58867.87300
Office Supplies Appliances 141680.58940
Storage 108461.48980
Binders 72449.84600
Paper 59207.68270
Art 57953.91090
Envelopes 29601.11630
Supplies 22583.26310
Labels 15010.51200
Fasteners 11525.42410
Furniture Bookcases 161924.41950
Chairs 141973.79750
Furnishings 46967.42550
Tables -64083.38870
0.5 Case Study 2: To analyze and answer business questions about 12 months
worth of sales data.
The data contains hundreds of thousands of electronics store purchases broken down by month,
product type, cost, purchase address, etc.
[35]: try:
# Attempt to read CSV file into a pandas DataFrame
all_data = pd.read_csv("/content/all_data.csv", encoding='utf-8')
print("CSV file successfully loaded.")
except Exception as e:
print("An error occurred while reading the CSV file:", e)
# Handle the error, or provide appropriate feedback to the user
[36]: all_data.head()
12
2 04/07/19 22:30 682 Chestnut St, Boston, MA 02215
3 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001
4 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001
Order ID Product Quantity Ordered Price Each Order Date Purchase Address
1 NaN NaN NaN NaN NaN NaN
356 NaN NaN NaN NaN NaN NaN
735 NaN NaN NaN NaN NaN NaN
1433 NaN NaN NaN NaN NaN NaN
1553 NaN NaN NaN NaN NaN NaN
<class 'pandas.core.frame.DataFrame'>
Index: 185950 entries, 0 to 186849
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order ID 185950 non-null object
1 Product 185950 non-null object
13
2 Quantity Ordered 185950 non-null object
3 Price Each 185950 non-null object
4 Order Date 185950 non-null object
5 Purchase Address 185950 non-null object
dtypes: object(6)
memory usage: 9.9+ MB
[41]: all_data.shape
[41]: (185950, 6)
[42]: all_data.describe()
14
[45]: all_data['Month'] = pd.to_datetime(all_data['Order Date']).dt.month
all_data.head()
<ipython-input-45-9ae23976486e>:1: UserWarning:
Could not infer format, so each element will be parsed individually, falling
back to `dateutil`. To ensure parsing is consistent and as-expected, please
specify a format.
def get_state(address):
return address.split(",")[2].split(" ")[1]
all_data.head()
15
4 04/12/19 14:38 669 Spruce St, Los Angeles, CA 90001 4
5 04/30/19 09:27 333 8th St, Los Angeles, CA 90001 4
City
0 Dallas (TX)
2 Boston (MA)
3 Los Angeles (CA)
4 Los Angeles (CA)
5 Los Angeles (CA)
Question 1: What was the best month for sales? How much was earned that month?
[49]: # Perform the calculation
all_data['Sales'] = all_data['Quantity Ordered'].astype('int') *␣
↪all_data['Price Each'].astype('float')
[51]: monthly_sales.head()
16
• Month 12 (December) is the highest sales in 2019 with approximately $4,810,000.
Question 2: What city sold the most product?
[54]: city_sales = all_data.groupby('City').agg({'Quantity Ordered': 'sum', 'Price␣
↪Each': 'sum', 'Sales': 'sum'})
city_sales.head()
# Plotting
plt.bar(city_sales.index, city_sales.values)
17
plt.xticks(rotation=90)
plt.ylabel('Sales in USD ($)')
plt.xlabel('City')
plt.title('Total Sales by City')
plt.show()
18
all_data['Count'] = 1
all_data.head()
<ipython-input-57-3f3d5aef9003>:2: UserWarning:
Could not infer format, so each element will be parsed individually, falling
back to `dateutil`. To ensure parsing is consistent and as-expected, please
specify a format.
<ipython-input-57-3f3d5aef9003>:3: UserWarning:
Could not infer format, so each element will be parsed individually, falling
back to `dateutil`. To ensure parsing is consistent and as-expected, please
specify a format.
plt.plot(keys, all_data.groupby(['Hour']).count()['Count'])
plt.grid()
plt.show()
19
There are approximately 2 peaks at the data. They are 12 (12 PM) and 19 (7 PM). It makes sense
since most people shop during the day. From this data, It can suggest to advertise their product
right before 12 PM and/or 7 PM. It could be 11.30 AM and/or 6.30 PM.
Question 4: What products are most often sold together?
[60]: df = all_data[all_data['Order ID'].duplicated(keep=False)]
<ipython-input-61-91e38189159a>:1: SettingWithCopyWarning:
20
[62]: from itertools import combinations
from collections import Counter
count = Counter()
21
• AAA batteries sold the most.
22
[65]: df.head()
[70]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9971 entries, 0 to 9970
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 9971 non-null object
1 SalesRep 9971 non-null object
2 Region 9971 non-null object
3 Product 9971 non-null object
4 Color 9971 non-null object
5 Units 9971 non-null int64
6 Revenue 9971 non-null float64
dtypes: float64(1), int64(1), object(5)
memory usage: 545.4+ KB
[66]: df.describe()
23
[71]: # Check missing entry
df.isna().sum()
[71]: Date 0
SalesRep 0
Region 0
Product 0
Color 0
Units 0
Revenue 0
dtype: int64
1. Revenue Analysis
[67]: df['Revenue'].value_counts().hist(bins=50);
24
3. What’s the total revenue generated between 2015-2017?
[69]: round(df['Revenue'].sum())
[69]: 909171
4. Revenue by Region
[72]: region_revenue = pd.DataFrame(df.groupby(by=['Region'])['Revenue'].sum())
region_revenue.sort_values(ascending=False, by='Revenue')
[72]: Revenue
Region
West 408037.58
South 263256.50
East 237876.79
25
• West Region generated the most revenue
5. Revenue by sales Rep
[74]: sales_rep_revenue = df.groupby(by=['SalesRep'])['Revenue'].sum()
sales_rep_revenue = pd.DataFrame(sales_rep_revenue).sort_values(ascending=True,␣
↪by='Revenue')
sales_rep_revenue
[74]: Revenue
SalesRep
Nicole 92026.68
Adam 102715.60
Jessica 145496.28
Nabil 158904.48
Julie 204450.05
Mike 205577.78
26
• Mike Slightly beat Julie in revenue generation
6. Revenue by Products
[76]: product_revenue = df[['Units', 'Revenue','Product']].groupby('Product').sum().
↪sort_values(ascending=False,by='Units')
product_revenue
27
[77]: product_revenue.groupby(by=['Product'])['Revenue'].sum().
↪sort_values(ascending=True).plot(
␣
↪ kind='bar',ylabel='Revenue',title='Product Revenue');
28
[79]: [2015, 2016, 2017]
The trend plot looks symmetrical for the months of October in 2017 and 2018 respectively.
29
Monthly Sales Trend
[82]: ax = df[['Month', 'Units', 'Revenue']].groupby('Month').sum().plot(
title='Monthly␣
↪Sales Trend',
␣
↪ylabel='Revenue',
);
ax.vlines(10,1,300000, linestyles='dashed')
ax.annotate('Oct',(10,0));
30
• Highest Entry in October.
9. Monthly Sales
[83]: products = pd.DataFrame(df[['Units','Revenue','Product','Month', 'Region']].
↪groupby('Month')['Product'].value_counts())
products
[83]: count
Month Product
1 Bellen 52
Quad 46
Sunbell 34
Sunshine 33
Aspen 33
… …
12 Sunbell 43
Aspen 41
Sunshine 36
Carlota 35
Doublers 32
products = products.reset_index()
31
[85]: Month Product No_of_products
0 1 Bellen 52
1 1 Quad 46
2 1 Sunbell 34
3 1 Sunshine 33
4 1 Aspen 33
.. … … …
79 12 Sunbell 43
80 12 Aspen 41
81 12 Sunshine 36
82 12 Carlota 35
83 12 Doublers 32
products
[87]: No_of_products
Product Aspen Bellen Carlota Doublers Quad Sunbell Sunshine
Month
1 33 52 30 29 46 34 33
2 26 45 29 25 45 35 26
3 49 55 46 34 77 35 39
4 31 37 35 27 50 43 25
5 33 52 36 30 51 37 31
6 37 53 23 21 52 30 36
7 38 60 37 25 55 46 42
8 34 54 26 35 51 35 35
9 380 596 399 311 560 397 404
10 476 735 439 333 697 491 460
11 119 154 118 77 142 108 103
12 41 55 35 32 64 43 36
32
10. Region Monthly Revenue
[89]: region_sales = pd.DataFrame(df[['Units','Revenue','Product','Month',␣
↪'Region']]).groupby(['Month','Region'])['Revenue'].sum()
region_sales = pd.DataFrame(region_sales)
region_sales
[89]: Revenue
Month Region
1 East 5012.34
South 7551.55
West 8550.33
2 East 6428.75
South 5540.10
West 10864.87
3 East 6082.75
South 8863.80
West 14087.99
4 East 6420.63
South 7647.28
33
West 8865.57
5 East 8782.68
South 5651.30
West 10962.00
6 East 6442.85
South 3954.90
West 9020.65
7 East 7180.45
South 10155.59
West 10150.25
8 East 6031.55
South 7767.60
West 11567.37
9 East 70532.44
South 83228.39
West 127160.06
10 East 87858.60
South 92034.70
West 151780.43
11 East 19478.10
South 24048.59
West 33196.52
12 East 7625.65
South 6812.70
West 11831.54
region_sales
[90]: Revenue
Region East South West
Month
1 5012.34 7551.55 8550.33
2 6428.75 5540.10 10864.87
3 6082.75 8863.80 14087.99
4 6420.63 7647.28 8865.57
5 8782.68 5651.30 10962.00
6 6442.85 3954.90 9020.65
7 7180.45 10155.59 10150.25
8 6031.55 7767.60 11567.37
9 70532.44 83228.39 127160.06
10 87858.60 92034.70 151780.43
11 19478.10 24048.59 33196.52
12 7625.65 6812.70 11831.54
34
[91]: region_sales.plot(kind='bar', ylabel='Revenue', title='Region Monthly Revenue');
[92]: Revenue
Date
2015 24883.84
2016 444701.72
2017 439585.31
[93]: changes.sort_values('Date').plot(kind='bar');
35
12. Top 3 products
[94]: product_revenue
36
salesReps.sort_values(by=['Year','Revenue'], ascending=False)
[95]: Revenue
Year SalesRep
2017 Julie 99727.32
Mike 96062.19
Nabil 81079.23
Jessica 69479.74
Adam 49712.19
Nicole 43524.64
2016 Mike 104590.64
Julie 98895.58
Nabil 74576.22
Jessica 71469.42
Adam 49184.21
Nicole 45985.65
2015 Julie 5827.15
Mike 4924.95
Jessica 4547.12
Adam 3819.20
Nabil 3249.03
Nicole 2516.39
Recommendation:
• The best months for sales are September, October and November.
• The company should look into creating jingles during these periods to further maximize profit.
• Focus the ad targeted audience on East and South Regions.
• Bellen and Quad sell most during these periods consider getting more of them.
37