0% found this document useful (0 votes)
37 views16 pages

Project Intern - Jupyter Notebook

The document discusses analyzing sales data from a Walmart dataset using Python and pandas. It loads necessary libraries, imports and cleans the data, checks for duplicates and outliers. Functions are defined to find outlier rows and count outliers. The outlier analysis finds outliers in the unemployment, holiday flag and weekly sales columns.

Uploaded by

ruchitcsganesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views16 pages

Project Intern - Jupyter Notebook

The document discusses analyzing sales data from a Walmart dataset using Python and pandas. It loads necessary libraries, imports and cleans the data, checks for duplicates and outliers. Functions are defined to find outlier rows and count outliers. The outlier analysis finds outliers in the unemployment, holiday flag and weekly sales columns.

Uploaded by

ruchitcsganesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

11/9/23, 7:41 PM project intern - Jupyter Notebook

In [4]: import numpy as np


import pandas as pd


import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set()


import datetime as dt


from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler


from sklearn.model_selection import train_test_split


from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline


from sklearn.metrics import mean_squared_error


import warnings
warnings.filterwarnings('ignore')

In [5]: import numpy as np


import pandas as pd


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

localhost:8888/notebooks/project intern.ipynb 1/16


11/9/23, 7:41 PM project intern - Jupyter Notebook

In [51]: sales = pd.read_csv('Walmart.csv')


sales.head(3)

Out[51]:
Store Date Weekly_Sales Holiday_Flag Temperature Fuel_Price CPI Unemploym

05-
0 1 02- 1643690.90 0 42.31 2.572 211.096358 8.
2010

12-
1 1 02- 1641957.44 1 38.51 2.548 211.242170 8.
2010

19-
2 1 02- 1611968.17 0 39.93 2.514 211.289143 8.
2010

In [52]: sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6435 entries, 0 to 6434
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Store 6435 non-null int64
1 Date 6435 non-null object
2 Weekly_Sales 6435 non-null float64
3 Holiday_Flag 6435 non-null int64
4 Temperature 6435 non-null float64
5 Fuel_Price 6435 non-null float64
6 CPI 6435 non-null float64
7 Unemployment 6435 non-null float64
dtypes: float64(5), int64(2), object(1)
memory usage: 402.3+ KB

In [53]: sales['Date'] = pd.to_datetime(sales.Date)

In [54]: sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6435 entries, 0 to 6434
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Store 6435 non-null int64
1 Date 6435 non-null datetime64[ns]
2 Weekly_Sales 6435 non-null float64
3 Holiday_Flag 6435 non-null int64
4 Temperature 6435 non-null float64
5 Fuel_Price 6435 non-null float64
6 CPI 6435 non-null float64
7 Unemployment 6435 non-null float64
dtypes: datetime64[ns](1), float64(5), int64(2)
memory usage: 402.3 KB

In [55]: sales.columns = [col.lower() for col in sales.columns ]

localhost:8888/notebooks/project intern.ipynb 2/16


11/9/23, 7:41 PM project intern - Jupyter Notebook

In [56]: sales.columns

Out[56]: Index(['store', 'date', 'weekly_sales', 'holiday_flag', 'temperature',


'fuel_price', 'cpi', 'unemployment'],
dtype='object')

In [57]: sales[sales.duplicated()]

Out[57]:
store date weekly_sales holiday_flag temperature fuel_price cpi unemployment

In [58]: def find_outlier_rows(df, col, level='both'):

iqr = df[col].quantile(0.75) - df[col].quantile(0.25)


lower_bound = df[col].quantile(0.25) - 1.5 * iqr


upper_bound = df[col].quantile(0.75) + 1.5 * iqr

if level == 'lower':
return df[df[col] < lower_bound]
elif level == 'upper':
return df[df[col] > upper_bound]
else:
return df[(df[col] > upper_bound) | (df[col] < lower_bound)]

In [59]: def count_outliers(df):

df_numeric = df.select_dtypes(include=['int', 'float'])




columns = df_numeric.columns

outlier_cols = [col for col in columns if len(find_outlier_rows(df_numer


outliers_df = pd.DataFrame(columns=['outlier_counts', 'outlier_percent']


for col in outlier_cols:


outlier_count = len(find_outlier_rows(df_numeric, col))
all_entries = len(df[col])
outlier_percent = round(outlier_count * 100 / all_entries, 2)

outliers_df.loc[col] = [outlier_count, outlier_percent]


return outliers_df

localhost:8888/notebooks/project intern.ipynb 3/16


11/9/23, 7:41 PM project intern - Jupyter Notebook

In [60]: count_outliers(sales).sort_values('outlier_counts', ascending=False)

Out[60]:
outlier_counts outlier_percent

unemployment 481.0 7.47

holiday_flag 450.0 6.99

weekly_sales 34.0 0.53

temperature 3.0 0.05

In [61]: find_outlier_rows(sales, 'unemployment')['unemployment'].describe()

Out[61]: count 481.000000


mean 11.447480
std 3.891387
min 3.879000
25% 11.627000
50% 13.503000
75% 14.021000
max 14.313000
Name: unemployment, dtype: float64

In [62]: find_outlier_rows(sales, 'holiday_flag')['holiday_flag'].describe()

Out[62]: count 450.0


mean 1.0
std 0.0
min 1.0
25% 1.0
50% 1.0
75% 1.0
max 1.0
Name: holiday_flag, dtype: float64

localhost:8888/notebooks/project intern.ipynb 4/16


11/9/23, 7:41 PM project intern - Jupyter Notebook

In [63]: find_outlier_rows(sales, 'weekly_sales')

localhost:8888/notebooks/project intern.ipynb 5/16


11/9/23, 7:41 PM project intern - Jupyter Notebook

Out[63]:
store date weekly_sales holiday_flag temperature fuel_price cpi unemploy

2010-
189 2 3436007.68 0 49.97 2.886 211.064660 8
12-24

2011-
241 2 3224369.80 0 46.66 3.112 218.999550 7
12-23

2010-
471 4 2789469.45 1 48.08 2.752 126.669267 7
11-26

2010-
474 4 2740057.14 0 46.57 2.884 126.879484 7
12-17

2010-
475 4 3526713.39 0 43.21 2.887 126.983581 7
12-24

2011-
523 4 3004702.33 1 47.96 3.225 129.836400 5
11-25

2011-
526 4 2771397.17 0 36.44 3.149 129.898065 5
12-16

2011-
527 4 3676388.98 0 35.92 3.103 129.984548 5
12-23

2010-
761 6 2727575.18 0 55.07 2.886 212.916508 7
12-24

2010-
1329 10 2939946.38 1 55.33 3.162 126.669267 9
11-26

2010-
1332 10 2811646.85 0 59.15 3.125 126.879484 9
12-17

2010-
1333 10 3749057.69 0 57.06 3.236 126.983581 9
12-24

2011-
1381 10 2950198.64 1 60.68 3.760 129.836400 7
11-25

2011-
1385 10 3487986.89 0 48.36 3.541 129.984548 7
12-23

2010-
1758 13 2766400.05 1 28.22 2.830 126.669267 7
11-26

2010-
1761 13 2771646.81 0 35.21 2.842 126.879484 7
12-17

2010-
1762 13 3595903.20 0 34.90 2.846 126.983581 7
12-24

2011-
1810 13 2864170.61 1 38.89 3.445 129.836400 6
11-25

2011-
1813 13 2760346.71 0 27.85 3.282 129.898065 6
12-16

2011-
1814 13 3556766.03 0 24.76 3.186 129.984548 6
12-23

2010-
1901 14 2921709.71 1 46.15 3.039 182.783277 8
11-26

2010-
1904 14 2762861.41 0 30.51 3.140 182.517732 8
12-17

2010-
1905 14 3818686.45 0 30.59 3.141 182.544590 8
12-24

2011-
1957 14 3369068.99 0 42.27 3.389 188.929975 8
12-23

2010-
2759 20 2811634.04 1 46.66 3.039 204.962100 7
11-26

localhost:8888/notebooks/project intern.ipynb 6/16


11/9/23, 7:41 PM project intern - Jupyter Notebook

store date weekly_sales holiday_flag temperature fuel_price cpi unemploy

2010-
2761 20 2752122.08 0 24.27 3.109 204.687738 7
10-12

2010-
2762 20 2819193.17 0 24.07 3.140 204.632119 7
12-17

2010-
2763 20 3766687.43 0 25.17 3.141 204.637673 7
12-24

2011-
2811 20 2906233.25 1 46.38 3.492 211.412076 7
11-25

2011-
2814 20 2762816.65 0 37.16 3.413 212.068504 7
12-16

2011-
2815 20 3555371.03 0 40.19 3.389 212.236040 7
12-23

2010-
3192 23 2734277.10 0 22.96 3.150 132.747742 5
12-24

2010-
3764 27 3078162.08 0 31.34 3.309 136.597273 8
12-24

2011-
3816 27 2739019.75 0 41.59 3.587 140.528765 7
12-23

In [64]: sales.describe()

Out[64]:
store weekly_sales holiday_flag temperature fuel_price cpi unem

count 6435.000000 6.435000e+03 6435.000000 6435.000000 6435.000000 6435.000000 643

mean 23.000000 1.046965e+06 0.069930 60.663782 3.358607 171.578394

std 12.988182 5.643666e+05 0.255049 18.444933 0.459020 39.356712

min 1.000000 2.099862e+05 0.000000 -2.060000 2.472000 126.064000

25% 12.000000 5.533501e+05 0.000000 47.460000 2.933000 131.735000

50% 23.000000 9.607460e+05 0.000000 62.670000 3.445000 182.616521

75% 34.000000 1.420159e+06 0.000000 74.940000 3.735000 212.743293

max 45.000000 3.818686e+06 1.000000 100.140000 4.468000 227.232807 1

localhost:8888/notebooks/project intern.ipynb 7/16


11/9/23, 7:41 PM project intern - Jupyter Notebook

In [65]: sales.hist(figsize=(30,20));

In [66]: ​
fig, ax = plt.subplots(figsize=(20, 5))
sns.lineplot(x=sales.date, y=(sales.weekly_sales/1e6))
plt.xlabel('months')
plt.ylabel('Weekly Sales (in million USD)')
plt.title('Weekly Sales Trend',fontdict={'fontsize': 16, 'color':'red'}, pad

annot = ax.annotate("", xy=(0,0), xytext=(20,20),textcoords="offset points",
bbox=dict(boxstyle="round", fc="w"),
arrowprops=dict(arrowstyle="->"))
annot.set_visible(False)

plt.show()

localhost:8888/notebooks/project intern.ipynb 8/16


11/9/23, 7:41 PM project intern - Jupyter Notebook

In [67]: sales['employment'] = 100 - sales['unemployment']




sales['year']= sales['date'].dt.year
sales['month'] = sales['date'].dt.month
sales['day'] = sales['date'].dt.day
sales.head(3)

Out[67]:
store date weekly_sales holiday_flag temperature fuel_price cpi unemploymen

2010-
0 1 1643690.90 0 42.31 2.572 211.096358 8.10
05-02

2010-
1 1 1641957.44 1 38.51 2.548 211.242170 8.10
12-02

2010-
2 1 1611968.17 0 39.93 2.514 211.289143 8.10
02-19

In [68]: pivot_table = sales.pivot_table(index='month', columns='year', values='weekl



pivot_table

Out[68]:
year 2010 2011 2012

month

1 9.386639e+05 9.420697e+05 9.567817e+05

2 1.064372e+06 1.042273e+06 1.057997e+06

3 1.034590e+06 1.011263e+06 1.025510e+06

4 1.021177e+06 1.033220e+06 1.014127e+06

5 1.039303e+06 1.015565e+06 1.053948e+06

6 1.055082e+06 1.038471e+06 1.082920e+06

7 1.023702e+06 9.976049e+05 1.025480e+06

8 1.025212e+06 1.044895e+06 1.064514e+06

9 9.983559e+05 1.026810e+06 9.988663e+05

10 1.027201e+06 1.020663e+06 1.044885e+06

11 1.176097e+06 1.126535e+06 1.042797e+06

12 1.198413e+06 1.274311e+06 1.025078e+06

localhost:8888/notebooks/project intern.ipynb 9/16


11/9/23, 7:41 PM project intern - Jupyter Notebook

In [69]: fig, ax = plt.subplots(figsize=(20, 6))


sns.set_palette("bright")
sns.lineplot(x=pivot_table.index, y=pivot_table[2010]/1e6, ax=ax, label='201
sns.lineplot( x=pivot_table.index, y=pivot_table[2011]/1e6, ax=ax, label='20
sns.lineplot( x=pivot_table.index, y=pivot_table[2012]/1e6, ax=ax, label='20
plt.ylabel('Average weekly sales (in millions USD)')
plt.title('Average Sales Trends for 2010, 2011 & 2012', fontdict ={'fontsize
'color':'
'horizont
pad=12)

plt.legend()
plt.show()

In [70]: def plot_top_and_bottom_stores(df, col):

df = df.groupby(col).mean().sort_values(by='weekly_sales', ascending=Fal

top_stores = df.head(5)
bottom_stores = df.tail(5)

sns.set_palette("bright")

fig, ax = plt.subplots(figsize=(10, 6))


sns.barplot(x=top_stores.index, y=top_stores['weekly_sales']/1e6, order=
plt.title('Top 5 Stores by Average Sales')
plt.ylabel('Average weekly sales (millions USD)')
plt.show()


fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(x=bottom_stores.index, y=bottom_stores['weekly_sales']/1e6,
plt.title('Bottom 5 Stores by Average Sales')
plt.ylabel('Average weekly sales (millions USD)')
plt.show()

localhost:8888/notebooks/project intern.ipynb 10/16


11/9/23, 7:41 PM project intern - Jupyter Notebook

In [71]: plot_top_and_bottom_stores(sales, 'store')

In [72]: non_holiday_sales = sales[sales['holiday_flag'] == 0]


holiday_sales = sales[sales['holiday_flag'] == 1]

localhost:8888/notebooks/project intern.ipynb 11/16


11/9/23, 7:41 PM project intern - Jupyter Notebook

In [73]: fig, ax = plt.subplots(figsize=(10, 5))


sns.boxplot(data=[holiday_sales['weekly_sales']/1e6, non_holiday_sales['week
plt.ylabel('Weekly sales in million USD')
plt.xlabel('Week type')
plt.title('Box plots of non-holiday and holiday weekly sales')
plt.show()

localhost:8888/notebooks/project intern.ipynb 12/16


11/9/23, 7:41 PM project intern - Jupyter Notebook

In [74]: fig, ax = plt.subplots(figsize=(15,15))


heatmap = sns.heatmap(sales.corr(), vmin=-1, vmax=1, annot=True, cmap ="YlGn
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':14}, pad=12);

In [75]: sales_copy = sales.copy()

localhost:8888/notebooks/project intern.ipynb 13/16


11/9/23, 7:41 PM project intern - Jupyter Notebook

In [76]: sales_copy.drop(['date', 'unemployment'], axis=1, inplace=True)



sales_copy.head()

Out[76]:
store weekly_sales holiday_flag temperature fuel_price cpi employment year

0 1 1643690.90 0 42.31 2.572 211.096358 91.894 2010

1 1 1641957.44 1 38.51 2.548 211.242170 91.894 2010

2 1 1611968.17 0 39.93 2.514 211.289143 91.894 2010

3 1 1409727.59 0 46.63 2.561 211.319643 91.894 2010

4 1 1554806.68 0 46.50 2.625 211.350143 91.894 2010

In [77]: X = sales_copy.drop('weekly_sales', axis=1)


y = sales_copy['weekly_sales']

In [78]: scaler = StandardScaler()


X_scaled = scaler.fit_transform(X)

In [79]: X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0

In [80]: def evaluate_model(model, X_train, y_train, X_test, y_test):

model.fit(X_train, y_train)
# predict
y_pred = model.predict(X_test)
# calculate MSE
mse = mean_squared_error(y_test, y_pred)
# calculate RMSE
rmse = np.sqrt(mse)
return rmse

In [81]: def evaluate_regressors_rmses(regressors, regressor_names, X_train, y_train,

rmses = [evaluate_model(regressor, X_train, y_train, X_test, y_test) for

regressor_rmses = dict(zip(regressor_names, rmses))

df = pd.DataFrame.from_dict(regressor_rmses, orient='index')

df = df.reset_index()

df.columns = ['regressor_name', 'rmse']

return df.sort_values('rmse', ignore_index=True)

localhost:8888/notebooks/project intern.ipynb 14/16


11/9/23, 7:41 PM project intern - Jupyter Notebook

In [82]: linear_regressor = LinearRegression()


polynomial_features = PolynomialFeatures(degree=2)
polynomial_regressor = Pipeline([("polynomial_features", polynomial_features
("linear_regression", linear_regressor)])
ridge_regressor = Ridge()
lasso_regressor = Lasso()
elastic_net_regressor = ElasticNet()
decision_tree_regressor = DecisionTreeRegressor()
random_forest_regressor = RandomForestRegressor()
boosted_tree_regressor = GradientBoostingRegressor()
neural_network_regressor = MLPRegressor()
support_vector_regressor = SVR()
grad_regressor = GradientBoostingRegressor(n_estimators=100, learning_rate=0
knn_regressor = KNeighborsRegressor(n_neighbors=5, weights='uniform')
spline_regressor = make_pipeline(PolynomialFeatures(3), LinearRegression())

In [83]: regressors = [linear_regressor, polynomial_regressor, ridge_regressor, lasso


decision_tree_regressor, random_forest_regressor, boosted_tree
support_vector_regressor, knn_regressor, spline_regressor]


regressor_names = ["Linear Regression", "Polynomial Regression", "Ridge Regr
"Elastic Net Regression", "Decision Tree Regression", "Ra
"Boosted Tree Regression", "Neural Network Regression", "
"K-Nearest Neighbour Regression", "Spline Regression"]

In [84]: print('\033[1m Table of regressors and their RMSEs')


evaluate_regressors_rmses(regressors, regressor_names, X_train, y_train, X_t

Table of regressors and their RMSEs

Out[84]:
regressor_name rmse

0 Random Forest Regression 1.143487e+05

1 Decision Tree Regression 1.455773e+05

2 Boosted Tree Regression 1.750067e+05

3 Spline Regression 4.376812e+05

4 K-Nearest Neighbour Regression 4.606521e+05

5 Polynomial Regression 4.786491e+05

6 Ridge Regression 5.209623e+05

7 Lasso Regression 5.209627e+05

8 Linear Regression 5.209628e+05

9 Elastic Net Regression 5.258987e+05

10 Support Vector Regression 5.687996e+05

11 Neural Network Regression 1.185606e+06

In [86]: rmse = evaluate_regressors_rmses(regressors, regressor_names, X_train, y_tra

localhost:8888/notebooks/project intern.ipynb 15/16


11/9/23, 7:41 PM project intern - Jupyter Notebook

In [87]: best_rmse = rmse.iloc[0]['rmse']


# compute the median of the weekly sales
median_sale = sales['weekly_sales'].median()
# compute percentage error
percent_deviation = round((best_rmse*100/median_sale), 2)
# print the result
print('The model has average percentage error of {}%'.format(percent_deviati

The model has average percentage error of 12.01%

localhost:8888/notebooks/project intern.ipynb 16/16

You might also like