Coffee Sales - (Data Analyst)
Coffee Sales - (Data Analyst)
Coffee Sales - (Data Analyst)
Dataset : Dataset is available in the given link. You can download it at your convenience.
About Dataset
Overview
This dataset contains detailed records of coffee sales from a vending machine.
The vending machine is the work of a dataset author who is committed to providing an open dataset to the
community.
It is intended for analysis of purchasing patterns, sales trends, and customer preferences related to coffee products.
The dataset spans from March 2024 to Present time, capturing daily transaction data. And new information
continues to be added.
Tasks
● Time Series Exploratory Data Analysis
● Next day/week/month sales
● Specific customer purchases
Author
NOTE :
1. this project is only for your guidance, not exactly the same you have to create. Here I am trying to show the
way or idea of what steps you can follow and how your projects look. Some projects are very advanced (because it
will be made with the help of flask, nlp, advance ai, advance DL and some advanced things ) which you can not understand .
2. You can make or analyze your project with yourself, with your idea, make it more creative from where we can
get some information and understand about our business. make sure what overall things you have created all
things you understand very well.
1. Data Collection
First, ensure you have the necessary libraries installed:
bash
Copy code
pip install pandas scikit-learn matplotlib seaborn
import pandas as pd
Assume the dataset has the following columns: Date, Store, Product, Sales,
Quantity, Price.
Removing Outliers
import numpy as np
# Sales by store
plt.figure(figsize=(10, 6))
sns.barplot(data=data, x='Store', y='Sales')
plt.title('Sales by Store')
plt.show()
# Sales by product
plt.figure(figsize=(10, 6))
sns.barplot(data=data, x='Product', y='Sales')
plt.title('Sales by Product')
plt.show()
# Make predictions
y_pred = model.predict(X_test)
Summary
This is a basic example. For a more robust analysis, you might consider advanced
techniques like cross-validation, feature selection, and trying different algorithms.
Sample code
Objective¶
This dataset contains detailed records of coffee sales from a vending machine. The dataset spans from
March 2024 to Present time, capturing daily transaction data. In this notebook, we are going to use EDA to
discover the customer's purchasing patterns and sales trends which can aid in the inventory planning.
Import packages
In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
/kaggle/input/coffee-sales/index.csv
Load data
In [2]:
coffee_data = pd.read_csv('/kaggle/input/coffee-sales/index.csv')
EDA
In [3]:
coffee_data.head()
Out[3]:
cash_typ mone
date datetime card coffee_name
e y
2024-03-01
0 2024-03-01 card ANON-0000-0000-0001 38.7 Latte
10:15:50.520
2024-03-01 Hot
1 2024-03-01 card ANON-0000-0000-0002 38.7
12:19:22.539 Chocolate
2024-03-01 Hot
2 2024-03-01 card ANON-0000-0000-0002 38.7
12:20:18.089 Chocolate
2024-03-01
3 2024-03-01 card ANON-0000-0000-0003 28.9 Americano
13:46:33.006
2024-03-01
4 2024-03-01 card ANON-0000-0000-0004 38.7 Latte
13:48:14.626
In [4]:
coffee_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1133 entries, 0 to 1132
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 1133 non-null object
1 datetime 1133 non-null object
2 cash_type 1133 non-null object
3 card 1044 non-null object
4 money 1133 non-null float64
5 coffee_name 1133 non-null object
dtypes: float64(1), object(5)
memory usage: 53.2+ KB
In [5]:
coffee_data.isnull().sum()
Out[5]:
date 0
datetime 0
cash_type 0
card 89
money 0
coffee_name 0
dtype: int64
In [6]:
coffee_data.duplicated().sum()
Out[6]:
In [7]:
coffee_data.describe().T
Out[7]:
ma
count mean std min 25% 50% 75%
x
mone
1133.0 33.105808 5.035366 18.12 28.9 32.82 37.72 40.0
y
In [8]:
coffee_data.loc[:,['cash_type','card','coffee_name']].describe().T
Out[8]:
coun uniqu
top freq
t e
cash_type 1133 2 card 1044
coffee_nam
1133 8 Americano with Milk 268
e
In [9]:
coffee_data[coffee_data['card'].isnull()]['cash_type'].value_counts()
Out[9]:
cash_type
cash 89
All of the transactions with null 'card' information are from cash users.
In [10]:
coffee_data['cash_type'].hist()
Out[10]:
<Axes: >
In [11]:
coffee_data['cash_type'].value_counts(normalize=True)
Out[11]:
cash_type
card 0.921447
cash 0.078553
In [12]:
pd.DataFrame(coffee_data['coffee_name'].value_counts(normalize=True).sort_values(asc
ending=False).round(4)*100)
Out[12]:
proportio
n
coffee_name
Americano with
23.65
Milk
Latte 21.45
Cappuccino 17.30
Americano 14.92
Cortado 8.74
Espresso 4.32
Cocoa 3.09
Americano with Milk and Latte are our most popular coffee products. In the second tier are Cappuccino
and Americano, while Cortado, Hot Chocolate, Espresso, and Cocoa are less popular.
In [13]:
In [14]:
coffee_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1133 entries, 0 to 1132
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 1133 non-null datetime64[ns]
1 datetime 1133 non-null datetime64[ns]
2 cash_type 1133 non-null object
3 card 1044 non-null object
4 money 1133 non-null float64
5 coffee_name 1133 non-null object
6 month 1133 non-null object
7 day 1133 non-null object
8 hour 1133 non-null object
dtypes: datetime64[ns](2), float64(1), object(6)
memory usage: 79.8+ KB
In [15]:
coffee_data.head()
Out[15]:
cash_ty mone coffee_nam da hou
date datetime card month
pe y e y r
In [16]:
[coffee_data['date'].min(),coffee_data['date'].max()]
Out[16]:
In [17]:
revenue_data =
coffee_data.groupby(['coffee_name']).sum(['money']).reset_index().sort_values(by='mo
ney',ascending=False)
In [18]:
plt.figure(figsize=(10,4))
ax = sns.barplot(data=revenue_data,x='money',y='coffee_name',color='steelblue')
ax.bar_label(ax.containers[0], fontsize=6)
plt.xlabel('Revenue')
Out[18]:
Text(0.5, 0, 'Revenue')
Latte is the product with the highest revenue, while Expresso is the one at the bottom. Then let's check
the monthly data.
In [19]:
monthly_sales =
coffee_data.groupby(['coffee_name','month']).count()['date'].reset_index().rename(co
lumns={'date':'count'}).pivot(index='month',columns='coffee_name',values='count').re
set_index()
monthly_sales
Out[19]:
coffee_nam American Americano with Cappuccin Coco Cortad Espress Hot Latt
month
e o Milk o a o o Chocolate e
2024-0
0 36 34 20 6 30 10 22 48
3
2024-0
1 35 42 43 6 19 7 13 31
4
2024-0
2 48 58 55 9 17 8 14 58
5
2024-0
3 14 69 46 5 19 10 14 50
6
2024-0
4 36 65 32 9 14 14 11 56
7
In [20]:
monthly_sales.describe().T.loc[:,['min','max']]
Out[20]:
ma
min
x
coffee_name
Americano with
34.0 69.0
Milk
In [21]:
plt.figure(figsize=(12,6))
sns.lineplot(data=monthly_sales)
plt.legend(loc='upper left')
plt.xticks(range(len(monthly_sales['month'])),monthly_sales['month'],size='small')
Out[21]:
([<matplotlib.axis.XTick at 0x7d45ae8a0430>,
<matplotlib.axis.XTick at 0x7d45ae8a0400>,
<matplotlib.axis.XTick at 0x7d45ae8a2ef0>,
<matplotlib.axis.XTick at 0x7d45ae8d3ee0>,
<matplotlib.axis.XTick at 0x7d45ae9149d0>],
[Text(0, 0, '2024-03'),
Text(1, 0, '2024-04'),
Text(2, 0, '2024-05'),
Text(3, 0, '2024-06'),
Text(4, 0, '2024-07')])
As shown in the line chart above, Americano with Milk and Latte, and Cappuccino are top selling coffee
types, while Cocoa and Expresso have lowest sales. Additionally, Americano with Milk and Latte show an
upward trending.
In [22]:
weekday_sales =
coffee_data.groupby(['day']).count()['date'].reset_index().rename(columns={'date':'c
ount'})
weekday_sales
Out[22]:
da coun
y t
0 0 151
1 1 151
2 2 185
3 3 165
4 4 164
5 5 163
6 6 154
In [23]:
plt.figure(figsize=(12,6))
sns.barplot(data=weekday_sales,x='day',y='count',color='steelblue')
plt.xticks(range(len(weekday_sales['day'])),['Sun','Mon','Tue','Wed','Thur','Fri','S
at'],size='small')
Out[23]:
([<matplotlib.axis.XTick at 0x7d45aea5b070>,
<matplotlib.axis.XTick at 0x7d45aea5b040>,
<matplotlib.axis.XTick at 0x7d45aea5af50>,
<matplotlib.axis.XTick at 0x7d45aeaa1240>,
<matplotlib.axis.XTick at 0x7d45aeaa1cf0>,
<matplotlib.axis.XTick at 0x7d45cf8c5f00>,
<matplotlib.axis.XTick at 0x7d45aeaa29b0>],
[Text(0, 0, 'Sun'),
Text(1, 0, 'Mon'),
Text(2, 0, 'Tue'),
Text(3, 0, 'Wed'),
Text(4, 0, 'Thur'),
Text(5, 0, 'Fri'),
Text(6, 0, 'Sat')])
The bar chart reveals that Tuesday has the highest sales of the week, while sales on the other days are
relatively similar.
In [24]:
daily_sales =
coffee_data.groupby(['coffee_name','date']).count()['datetime'].reset_index().reset_
index().rename(columns={'datetime':'count'}).pivot(index='date',columns='coffee_name
',values='count').reset_index().fillna(0)
daily_sales
Out[24]:
coffee_nam American Americano with Cappuccin Coco Cortad Espress Hot Latt
date
e o Milk o a o o Chocolate e
2024-03-0
0 1.0 4.0 0.0 1.0 0.0 0.0 3.0 2.0
1
2024-03-0
1 3.0 3.0 0.0 0.0 0.0 0.0 0.0 1.0
2
2024-03-0
2 1.0 2.0 0.0 1.0 2.0 0.0 2.0 2.0
3
2024-03-0
3 0.0 1.0 0.0 0.0 0.0 1.0 0.0 2.0
4
2024-03-0
4 0.0 0.0 0.0 1.0 1.0 0.0 4.0 3.0
5
... ... ... ... ... ... ... ... ... ...
2024-07-2
145 0.0 5.0 4.0 0.0 0.0 2.0 0.0 2.0
7
2024-07-2
146 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0
8
2024-07-2
147 3.0 2.0 2.0 1.0 0.0 0.0 2.0 1.0
9
2024-07-3
148 2.0 12.0 2.0 0.0 3.0 2.0 0.0 3.0
0
2024-07-3
149 2.0 6.0 1.0 2.0 4.0 0.0 0.0 7.0
1
In [25]:
daily_sales.iloc[:,1:].describe().T.loc[:,['min','max']]
Out[25]:
mi ma
n x
coffee_name
This table provides us the infomation of how many of each products can be sold in each day.
In [26]:
hourly_sales =
coffee_data.groupby(['hour']).count()['date'].reset_index().rename(columns={'date':'
count'})
hourly_sales
Out[26]:
hou coun
r t
0 07 13
1 08 44
2 09 50
3 10 133
4 11 103
5 12 87
6 13 78
7 14 76
8 15 65
9 16 77
10 17 77
11 18 75
12 19 96
13 20 54
14 21 70
15 22 35
In [27]:
sns.barplot(data=hourly_sales,x='hour',y='count',color='steelblue')
Out[27]:
In [28]:
hourly_sales_by_coffee =
coffee_data.groupby(['hour','coffee_name']).count()['date'].reset_index().rename(col
umns={'date':'count'}).pivot(index='hour',columns='coffee_name',values='count').fill
na(0).reset_index()
hourly_sales_by_coffee
Out[28]:
In [29]:
# Loop through each column in the DataFrame, skipping the 'Index' column
for i, column in enumerate(hourly_sales_by_coffee.columns[1:]): # Skip the first
column ('Index')
axs[i].bar(hourly_sales_by_coffee['hour'], hourly_sales_by_coffee[column])
axs[i].set_title(f'{column}')
axs[i].set_xlabel('Hour')
#axs[i].set_ylabel('Sales')
plt.tight_layout()
Conclusion
From the analysis above, we have uncovered valuable insights into customer shopping patterns on a
daily and weekly basis. We have identified the most popular coffee products and observed the shopping
trends over time. These findings are instrumental in optimizing inventory planning, designing the layout
of vending machines, and determining the ideal restock times for coffee products.
1 Reference link
2 Reference link for ML project