Python For Business Decision Making Asm2
Python For Business Decision Making Asm2
May 3, 2023
Date: Date of the transactions(group by day) Price: Unit price Qty: Quantity of products Item:
Name of the item Holiday: Name of the holiday (0= non holiday, 1= holiday) Is Weekend: Flag of
weekend (0= week day,1= is weekend) Is Schoolbreak: Flag of schoolbreak (0= non schoolbreak,
1= schoolbreak) total_sales
[ ]: import pandas as pd
from scipy.stats import stats
import warnings
warnings.filterwarnings('ignore')
drive.mount('/content/drive')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 608 non-null object
1 price 608 non-null float64
2 qty 608 non-null int64
3 item 608 non-null object
4 holiday 608 non-null int64
5 is_weekend 608 non-null int64
6 is_schoolbreak 608 non-null int64
7 total_sales 608 non-null int64
1
dtypes: float64(1), int64(5), object(2)
memory usage: 38.1+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 608 non-null datetime64[ns]
1 price 608 non-null float64
2 qty 608 non-null int64
3 item 608 non-null object
4 holiday 608 non-null int64
5 is_weekend 608 non-null int64
6 is_schoolbreak 608 non-null int64
7 total_sales 608 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 38.1+ KB
total_sales
0 1116
1 1178
2 1054
3 1147
4 1085
.. …
603 1334
604 1305
605 986
2
606 928
607 1305
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 608 non-null datetime64[ns]
1 price 608 non-null float64
2 qty 608 non-null int64
3 item 608 non-null object
4 holiday 608 non-null int64
5 is_weekend 608 non-null int64
6 is_schoolbreak 608 non-null int64
7 total_sales 608 non-null int64
8 month_year 608 non-null object
dtypes: datetime64[ns](1), float64(1), int64(5), object(2)
memory usage: 42.9+ KB
# group by month and year and calculate the total sales, quantity, and mean␣
↪price
monthly_price = stores.groupby('month_year')['price'].mean().round(2)
monthly_qty = stores.groupby('month_year')['qty'].sum()
monthly_total_sales = stores.groupby('month_year')['total_sales'].sum()
3
monthly_summary = pd.concat([monthly_price, monthly_qty, monthly_total_sales],␣
↪axis=1)
4
[ ]: # Checking for duplicate data
print("Number of duplicated records: ", len(stores[stores.duplicated()]))
# Define outliers
price_outliers = (stores['price'] < (Q1 - 1.5 * IQR)) | (stores['price'] > (Q3␣
↪+ 1.5 * IQR))
[ ]: <Axes: xlabel='price'>
5
[ ]: # Calculate IQR for 'qty' variable
Q1 = stores['qty'].quantile(0.25)
Q3 = stores['qty'].quantile(0.75)
IQR = Q3 - Q1
# Define outliers
qty_outliers = (stores['qty'] < (Q1 - 1.5 * IQR)) | (stores['qty'] > (Q3 + 1.5␣
↪* IQR))
[ ]: <Axes: xlabel='qty'>
6
[ ]: # Calculate IQR for 'total_sales' variable
Q1 = stores['total_sales'].quantile(0.25)
Q3 = stores['total_sales'].quantile(0.75)
IQR = Q3 - Q1
# Define outliers
sales_outliers = (stores['total_sales'] < (Q1 - 1.5 * IQR)) |␣
↪(stores['total_sales'] > (Q3 + 1.5 * IQR))
[ ]: <Axes: xlabel='total_sales'>
7
[ ]: # calculate sum of all outliers
total_outliers = price_outliers.sum() + qty_outliers.sum() + sales_outliers.
↪sum()
8
min 14.00 2194.00 34888.00
25% 14.50 2344.50 35672.75
50% 15.10 2437.00 36719.50
75% 15.50 2619.50 38152.25
max 16.00 2910.00 43090.00
Univariate analysis: The categorical variables are: holiday is_weekend is_schoolbreak
The continuous variables are: price qty total_sales
[ ]: cat_cols =['holiday', 'is_weekend','is_schoolbreak']
con_cols =['price', 'qty','total_sales']
*, Column: holiday
2 unique values
*, Column: is_weekend
2 unique values
*, Column: is_schoolbreak
2 unique values
*, Column: price
4
*, Column: qty
39
*, Column: total_sales
64
Categorical:
[ ]: import sys
sys.path.append("/content/drive/MyDrive/Python for Da/")
import EDA_funcs
9
[ ]: for cat in cat_cols:
print('Univariate analysis', cat)
univariate_analysis_categorical_variable_2(stores, cat)
print()
10
Number of upper outliers: 0
Number of lower outliers: 0
Percentage of ouliers: 0.0
11
12
Univariate analysis qty
Describe:
count 608.000000
mean 81.980263
std 16.412303
min 38.000000
25% 68.000000
50% 84.000000
75% 92.500000
max 124.000000
Name: qty, dtype: float64
Mode: 0 84
Name: qty, dtype: int64
Range: 86
IQR: 24.5
Var: 269.3636954825284
Std: 16.412303174220504
Skew: -0.13864071557455301
Kurtosis: -0.3265014211143429
13
Number of upper outliers: 0
Number of lower outliers: 0
Percentage of ouliers: 0.0
14
Univariate analysis total_sales
Describe:
count 608.000000
mean 1231.547697
std 230.822548
min 589.000000
25% 986.000000
50% 1312.000000
75% 1372.000000
max 1736.000000
Name: total_sales, dtype: float64
Mode: 0 1344
Name: total_sales, dtype: int64
Range: 1147
IQR: 386.0
Var: 53279.04879205324
Std: 230.82254827475856
Skew: -0.409031487896684
Kurtosis: -0.5073796235684283
15
Number of upper outliers: 0
Number of lower outliers: 0
Percentage of ouliers: 0.0
16
17
Bi-variable analysis (total_sales with others).
Continuos - Continuos
[ ]: for i in range(0, len(con_cols)):
col1 = con_cols[i]
col2 = 'total_sales'
print('Bi-variable analysis', col1, 'and', col2)
print(stores[[col1, col2]].corr())
print()
[ ]: sns.pairplot(stores[["total_sales", "price"]])
[ ]: <seaborn.axisgrid.PairGrid at 0x7feb29120fd0>
18
[ ]: sns.pairplot(stores[["total_sales", "qty"]])
[ ]: <seaborn.axisgrid.PairGrid at 0x7feb28f55450>
19
Two-variable analysis price and total_sales:
The correlation coefficient between price and total_sales is -0.108. This indicates a weak negative
correlation between the two variables. In other words, as price increases, total_sales tends to
decrease slightly. However, the correlation is weak, so this relationship may not be statistically
significant.
Two-variable analysis qty and total_sales:
The correlation coefficient between qty and total_sales is 0.968. This indicates a strong positive
correlation between the two variables. In other words, as qty increases, total_sales tends to
increase as well. This relationship is statistically significant and suggests that qty is a strong
predictor of total_sales.
Categorical - Continuos
[ ]: cat_cols
20
[ ]: # ANOVA
import statsmodels.api as sm
from statsmodels.formula.api import ols
[ ]: import statsmodels.api as sm
from statsmodels.formula.api import ols
# perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)
sum_sq df F PR(>F)
holiday 5.146908e+06 1.0 549.695410 6.209636e-87
is_weekend 2.129255e+07 1.0 2274.067488 6.121558e-207
is_schoolbreak 1.631602e+04 1.0 1.742568 1.873138e-01
Residual 5.655373e+06 604.0 NaN NaN
The ANOVA table shows the results of a linear regression model that investigates the relationship
between total sales and three predictor variables: holiday, is_weekend, and is_schoolbreak.
The table shows that both holiday and is_weekend have a significant effect on total sales,
with very low p-values (6.21e-87 and 6.12e-207, respectively). However, is_schoolbreak does not
have a significant effect on total sales, with a relatively high p-value (1.87e-01).
Overall, these results suggest that holiday and is_weekend are strong predictors of total sales, while
is_schoolbreak does not have a significant effect.
21