0% found this document useful (0 votes)
22 views

Python For Business Decision Making Asm2

The document discusses analyzing sales data from burger stores between January 2014 to September 2015. It cleans the data by changing data types, filtering dates, and checking for missing values, duplicates and outliers. It then calculates descriptive statistics on the monthly aggregated data, including the mean price, total quantity and total sales by month. No outliers were found in the data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Python For Business Decision Making Asm2

The document discusses analyzing sales data from burger stores between January 2014 to September 2015. It cleans the data by changing data types, filtering dates, and checking for missing values, duplicates and outliers. It then calculates descriptive statistics on the monthly aggregated data, including the mean price, total quantity and total sales by month. No outliers were found in the data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Python_for_Business_Decision_Making_Asm2

May 3, 2023

Date: Date of the transactions(group by day) Price: Unit price Qty: Quantity of products Item:
Name of the item Holiday: Name of the holiday (0= non holiday, 1= holiday) Is Weekend: Flag of
weekend (0= week day,1= is weekend) Is Schoolbreak: Flag of schoolbreak (0= non schoolbreak,
1= schoolbreak) total_sales

[ ]: from google.colab import drive


drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call


drive.mount("/content/drive", force_remount=True).

[ ]: import pandas as pd
from scipy.stats import stats
import warnings
warnings.filterwarnings('ignore')

[ ]: from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call


drive.mount("/content/drive", force_remount=True).

[ ]: stores = pd.read_csv('/content/drive/MyDrive/Python for Da/Burger_store.csv')


stores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 608 non-null object
1 price 608 non-null float64
2 qty 608 non-null int64
3 item 608 non-null object
4 holiday 608 non-null int64
5 is_weekend 608 non-null int64
6 is_schoolbreak 608 non-null int64
7 total_sales 608 non-null int64

1
dtypes: float64(1), int64(5), object(2)
memory usage: 38.1+ KB

[ ]: # change 'date' from object type to daytime


stores['date'] = pd.to_datetime(stores['date'])
stores.info()
stores

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 608 non-null datetime64[ns]
1 price 608 non-null float64
2 qty 608 non-null int64
3 item 608 non-null object
4 holiday 608 non-null int64
5 is_weekend 608 non-null int64
6 is_schoolbreak 608 non-null int64
7 total_sales 608 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 38.1+ KB

[ ]: date price qty item holiday is_weekend is_schoolbreak \


0 2014-01-01 15.5 72 BURGER 1 0 0
1 2014-01-02 15.5 76 BURGER 1 0 0
2 2014-01-03 15.5 68 BURGER 1 0 0
3 2014-01-04 15.5 74 BURGER 0 1 0
4 2014-01-05 15.5 70 BURGER 0 1 0
.. … … … … … … …
603 2015-08-27 14.5 92 BURGER 0 0 1
604 2015-08-28 14.5 90 BURGER 0 0 1
605 2015-08-29 14.5 68 BURGER 0 1 1
606 2015-08-30 14.5 64 BURGER 0 1 1
607 2015-08-31 14.5 90 BURGER 0 0 1

total_sales
0 1116
1 1178
2 1054
3 1147
4 1085
.. …
603 1334
604 1305
605 986

2
606 928
607 1305

[608 rows x 8 columns]

[ ]: # convert the date column to a datetime object


stores['date'] = pd.to_datetime(stores['date'])

# create a new column with the month and day information


stores['month_year'] = stores['date'].dt.strftime('%B %Y')
stores.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 608 non-null datetime64[ns]
1 price 608 non-null float64
2 qty 608 non-null int64
3 item 608 non-null object
4 holiday 608 non-null int64
5 is_weekend 608 non-null int64
6 is_schoolbreak 608 non-null int64
7 total_sales 608 non-null int64
8 month_year 608 non-null object
dtypes: datetime64[ns](1), float64(1), int64(5), object(2)
memory usage: 42.9+ KB

[ ]: # convert the data type of the month_year column to datetime


stores['month_year'] = pd.to_datetime(stores['month_year'], format='%B %Y')

# filter the records from January 2014 to September 2015


start_date = pd.to_datetime('2014-01-01')
end_date = pd.to_datetime('2015-09-30')
mask = (stores['month_year'] >= start_date) & (stores['month_year'] <= end_date)
stores = stores.loc[mask]

# group by month and year and calculate the total sales, quantity, and mean␣
↪price

monthly_price = stores.groupby('month_year')['price'].mean().round(2)
monthly_qty = stores.groupby('month_year')['qty'].sum()
monthly_total_sales = stores.groupby('month_year')['total_sales'].sum()

# combine the results into a single DataFrame

3
monthly_summary = pd.concat([monthly_price, monthly_qty, monthly_total_sales],␣
↪axis=1)

# display the result


print(monthly_summary)

price qty total_sales


month_year
2014-01-01 15.50 2780 43090
2014-02-01 15.50 2268 35154
2014-03-01 15.50 2390 37045
2014-04-01 15.13 2348 35498
2014-05-01 14.50 2604 37758
2014-06-01 14.50 2480 35960
2014-07-01 14.50 2666 38657
2014-08-01 14.95 2528 37766
2014-09-01 15.50 2334 36177
2014-10-01 15.50 2348 36394
2014-11-01 15.50 2302 35681
2014-12-01 14.73 2884 42197
2015-01-01 14.00 2910 40740
2015-02-01 14.00 2492 34888
2015-03-01 14.00 2802 39228
2015-04-01 15.07 2394 35948
2015-05-01 16.00 2194 35104
2015-06-01 16.00 2228 35648
2015-07-01 16.00 2374 37984
2015-08-01 15.08 2518 37864

0.1 1.b. Clean data:


[ ]: # Checking for missing data
print("Number of missing values in each column:\n", stores.isnull().sum())

Number of missing values in each column:


date 0
price 0
qty 0
item 0
holiday 0
is_weekend 0
is_schoolbreak 0
total_sales 0
month_year 0
dtype: int64

4
[ ]: # Checking for duplicate data
print("Number of duplicated records: ", len(stores[stores.duplicated()]))

Number of duplicated records: 0

[ ]: # Checking for outlier data


import seaborn as sns
import matplotlib.pyplot as plt

[ ]: # Calculate IQR for 'price' variable


Q1 = stores['price'].quantile(0.25)
Q3 = stores['price'].quantile(0.75)
IQR = Q3 - Q1

# Define outliers
price_outliers = (stores['price'] < (Q1 - 1.5 * IQR)) | (stores['price'] > (Q3␣
↪+ 1.5 * IQR))

# Print number of outliers


print("Number of outliers for 'price':", price_outliers.sum())

# Create box plot


sns.boxplot(x=stores['price'])

Number of outliers for 'price': 0

[ ]: <Axes: xlabel='price'>

5
[ ]: # Calculate IQR for 'qty' variable
Q1 = stores['qty'].quantile(0.25)
Q3 = stores['qty'].quantile(0.75)
IQR = Q3 - Q1

# Define outliers
qty_outliers = (stores['qty'] < (Q1 - 1.5 * IQR)) | (stores['qty'] > (Q3 + 1.5␣
↪* IQR))

# Print number of outliers


print("Number of outliers for 'qty':", qty_outliers.sum())

# Create box plot


sns.boxplot(x=stores['qty'])

Number of outliers for 'qty': 0

[ ]: <Axes: xlabel='qty'>

6
[ ]: # Calculate IQR for 'total_sales' variable
Q1 = stores['total_sales'].quantile(0.25)
Q3 = stores['total_sales'].quantile(0.75)
IQR = Q3 - Q1

# Define outliers
sales_outliers = (stores['total_sales'] < (Q1 - 1.5 * IQR)) |␣
↪(stores['total_sales'] > (Q3 + 1.5 * IQR))

# Print number of outliers


print("Number of outliers for 'total_sales':", sales_outliers.sum())

# Create box plot


sns.boxplot(x=stores['total_sales'])

Number of outliers for 'total_sales': 0

[ ]: <Axes: xlabel='total_sales'>

7
[ ]: # calculate sum of all outliers
total_outliers = price_outliers.sum() + qty_outliers.sum() + sales_outliers.
↪sum()

# print sum of all outliers


print("Sum of all outliers: " + str(total_outliers))

Sum of all outliers: 0

1 1.c. Calculate descriptive statistics:


[ ]: # calculate descriptive statistics
summary_stats = monthly_summary.describe().round(2)

# display the result


print(summary_stats)

price qty total_sales


count 20.00 20.00 20.00
mean 15.07 2492.20 37439.05
std 0.66 217.34 2354.58

8
min 14.00 2194.00 34888.00
25% 14.50 2344.50 35672.75
50% 15.10 2437.00 36719.50
75% 15.50 2619.50 38152.25
max 16.00 2910.00 43090.00
Univariate analysis: The categorical variables are: holiday is_weekend is_schoolbreak
The continuous variables are: price qty total_sales
[ ]: cat_cols =['holiday', 'is_weekend','is_schoolbreak']
con_cols =['price', 'qty','total_sales']

[ ]: for column in cat_cols:


print("*, Column: ", column)
print(len(stores[column].unique()), "unique values")

*, Column: holiday
2 unique values
*, Column: is_weekend
2 unique values
*, Column: is_schoolbreak
2 unique values

[ ]: for column in con_cols:


print("*, Column: ", column)
print(len(stores[column].unique()))

*, Column: price
4
*, Column: qty
39
*, Column: total_sales
64
Categorical:
[ ]: import sys
sys.path.append("/content/drive/MyDrive/Python for Da/")
import EDA_funcs

[ ]: from EDA_funcs import *


import scipy
from scipy.stats import chi2_contingency
from scipy.stats import chi2
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

9
[ ]: for cat in cat_cols:
print('Univariate analysis', cat)
univariate_analysis_categorical_variable_2(stores, cat)
print()

[ ]: for con in con_cols:


print('Univariate analysis', con)
univariate_analysis_continuous_variable(stores, stores[con])
check_outlier(stores, stores[con])
univariate_visualization_analysis_continuous_variable_new(stores[con])
print()

Univariate analysis price


Describe:
count 608.000000
mean 15.074013
std 0.735843
min 14.000000
25% 14.500000
50% 15.500000
75% 15.500000
max 16.000000
Name: price, dtype: float64
Mode: 0 15.5
Name: price, dtype: float64
Range: 2.0
IQR: 1.0
Var: 0.5414652518858927
Std: 0.73584322507304
Skew: -0.2570967196392115
Kurtosis: -1.4796550712211953

10
Number of upper outliers: 0
Number of lower outliers: 0
Percentage of ouliers: 0.0

11
12
Univariate analysis qty
Describe:
count 608.000000
mean 81.980263
std 16.412303
min 38.000000
25% 68.000000
50% 84.000000
75% 92.500000
max 124.000000
Name: qty, dtype: float64
Mode: 0 84
Name: qty, dtype: int64
Range: 86
IQR: 24.5
Var: 269.3636954825284
Std: 16.412303174220504
Skew: -0.13864071557455301
Kurtosis: -0.3265014211143429

13
Number of upper outliers: 0
Number of lower outliers: 0
Percentage of ouliers: 0.0

14
Univariate analysis total_sales
Describe:
count 608.000000
mean 1231.547697
std 230.822548
min 589.000000
25% 986.000000
50% 1312.000000
75% 1372.000000
max 1736.000000
Name: total_sales, dtype: float64
Mode: 0 1344
Name: total_sales, dtype: int64
Range: 1147
IQR: 386.0
Var: 53279.04879205324
Std: 230.82254827475856
Skew: -0.409031487896684
Kurtosis: -0.5073796235684283

15
Number of upper outliers: 0
Number of lower outliers: 0
Percentage of ouliers: 0.0

16
17
Bi-variable analysis (total_sales with others).
Continuos - Continuos
[ ]: for i in range(0, len(con_cols)):
col1 = con_cols[i]
col2 = 'total_sales'
print('Bi-variable analysis', col1, 'and', col2)
print(stores[[col1, col2]].corr())
print()

Bi-variable analysis price and total_sales


price total_sales
price 1.000000 -0.108335
total_sales -0.108335 1.000000

Bi-variable analysis qty and total_sales


qty total_sales
qty 1.00000 0.96776
total_sales 0.96776 1.00000

Bi-variable analysis total_sales and total_sales


total_sales total_sales
total_sales 1.0 1.0
total_sales 1.0 1.0

[ ]: sns.pairplot(stores[["total_sales", "price"]])

[ ]: <seaborn.axisgrid.PairGrid at 0x7feb29120fd0>

18
[ ]: sns.pairplot(stores[["total_sales", "qty"]])

[ ]: <seaborn.axisgrid.PairGrid at 0x7feb28f55450>

19
Two-variable analysis price and total_sales:
The correlation coefficient between price and total_sales is -0.108. This indicates a weak negative
correlation between the two variables. In other words, as price increases, total_sales tends to
decrease slightly. However, the correlation is weak, so this relationship may not be statistically
significant.
Two-variable analysis qty and total_sales:
The correlation coefficient between qty and total_sales is 0.968. This indicates a strong positive
correlation between the two variables. In other words, as qty increases, total_sales tends to
increase as well. This relationship is statistically significant and suggests that qty is a strong
predictor of total_sales.
Categorical - Continuos
[ ]: cat_cols

[ ]: ['holiday', 'is_weekend', 'is_schoolbreak']

20
[ ]: # ANOVA
import statsmodels.api as sm
from statsmodels.formula.api import ols

[ ]: d_melt = stores[['holiday', 'is_weekend','is_schoolbreak', 'total_sales']]


d_melt.head()

[ ]: holiday is_weekend is_schoolbreak total_sales


0 1 0 0 1116
1 1 0 0 1178
2 1 0 0 1054
3 0 1 0 1147
4 0 1 0 1085

[ ]: import statsmodels.api as sm
from statsmodels.formula.api import ols

# create the linear regression model


model = ols('total_sales ~ holiday + is_weekend + is_schoolbreak', data=stores).
↪fit()

# perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# print the ANOVA table


print(anova_table)

sum_sq df F PR(>F)
holiday 5.146908e+06 1.0 549.695410 6.209636e-87
is_weekend 2.129255e+07 1.0 2274.067488 6.121558e-207
is_schoolbreak 1.631602e+04 1.0 1.742568 1.873138e-01
Residual 5.655373e+06 604.0 NaN NaN
The ANOVA table shows the results of a linear regression model that investigates the relationship
between total sales and three predictor variables: holiday, is_weekend, and is_schoolbreak.
The table shows that both holiday and is_weekend have a significant effect on total sales,
with very low p-values (6.21e-87 and 6.12e-207, respectively). However, is_schoolbreak does not
have a significant effect on total sales, with a relatively high p-value (1.87e-01).
Overall, these results suggest that holiday and is_weekend are strong predictors of total sales, while
is_schoolbreak does not have a significant effect.

21

You might also like