0% found this document useful (0 votes)
61 views22 pages

PRJ Sales Forecasting

This document discusses sales forecasting using a dataset on item sales from various outlets. It includes: 1) Importing libraries and reading the dataset. 2) Exploratory data analysis including missing value imputation. 3) Feature engineering such as label encoding and one-hot encoding. 4) Splitting data into train and test sets. 5) Building and comparing various regression models like linear, lasso, random forest.

Uploaded by

shivaybhargava33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views22 pages

PRJ Sales Forecasting

This document discusses sales forecasting using a dataset on item sales from various outlets. It includes: 1) Importing libraries and reading the dataset. 2) Exploratory data analysis including missing value imputation. 3) Feature engineering such as label encoding and one-hot encoding. 4) Splitting data into train and test sets. 5) Building and comparing various regression models like linear, lasso, random forest.

Uploaded by

shivaybhargava33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

SALES FORECASTING

Dependent variable: Item_Outlet_Sales

Import Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import os
import warnings
warnings.filterwarnings('ignore')
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6

Check the current working directory

display (os.getcwd())

Change the current working directory

os.chdir ('C:\\Noble\\Training\\Acmegrade\\Data Science\\Projects\\PRJ Sales


Forecasting\\')
display (os.getcwd())
Read and display the data set

dt = pd.read_csv('Train.csv')
display (dt.head())

Display the shape


print (dt.shape)

Display the column names


display (dt.columns)

Describe the column


display (dt.describe())

Display Info
display (dt.info())

Display the Unique Values for each column


display (dt.apply(lambda x: len(x.unique())))

Check for Null Values

display (dt.isnull().sum())
Store the Categorical columns in a list

cat_col = []
for x in dt.dtypes.index:
if dt.dtypes[x] == 'object':
cat_col.append(x)
display (cat_col)

Delete the columns


cat_col.remove('Item_Identifier')
cat_col.remove('Outlet_Identifier')
display (cat_col)

Display the Unique Values in category columns – Count


for col in cat_col:
print(col , len(dt[col].unique()))

Unique values in each category

for col in cat_col:


print(col)
print(dt[col].value_counts())
print()
print ('*' *50)
Display the missing values - missing values will be True
miss_bool = dt['Item_Weight'].isnull()
display (miss_bool)

Missing value count - column - Item_Weight

display (dt['Item_Weight'].isnull().sum())

Display all NULL Records


Item_Weight_null = dt[dt['Item_Weight'].isna()]
display (Item_Weight_null)

NULL Records by Item Identifier column

Item_Weight_null['Item_Identifier'].value_counts()

Find the mean for the column – Item Weight group by Item Identifier

item_weight_mean = dt.pivot_table(values = "Item_Weight", index =


'Item_Identifier')
display (item_weight_mean)

Display Item Identifier column

display (dt['Item_Identifier'])
Fill the missing values (Item Weight) with mean, the mean calculated by
group by Item identifier

for i, item in enumerate(dt['Item_Identifier']):


if miss_bool[i]:
if item in item_weight_mean:
dt['Item_Weight'][i] = item_weight_mean.loc[item]['Item_Weight']
else:
dt['Item_Weight'][i] = np.mean(dt['Item_Weight'])

Check the Null values again – Same column


display (dt['Item_Weight'].isnull().sum())

Record count based on 'Outlet_Size'

dt.groupby('Outlet_Size').agg({'Outlet_Size': np.size})

NULL Record based on 'Outlet_Size'

display (dt['Outlet_Size'].isnull().sum())

Display all NULL Records


Outlet_Size_null= dt[dt['Outlet_Size'].isna()]
display (Outlet_Size_null)
Null Record count based on -Outlet Type

Outlet_Size_null['Outlet_Type'].value_counts()

Group by based on Outlet_Type and Outlet_Size to find the most repeated


value, this is to fill missing value by Outlet Type

dt.groupby (['Outlet_Type','Outlet_Size'] ).agg({'Outlet_Type':[np.size]})

Alternate way to identify most repeated value – Mode


outlet_size_mode = dt.pivot_table(values='Outlet_Size',
columns='Outlet_Type', aggfunc=(lambda x: x.mode()[0]))
display (outlet_size_mode)

Use Mode to fill missing values


miss_bool = dt['Outlet_Size'].isnull()
dt.loc[miss_bool, 'Outlet_Size'] = dt.loc[miss_bool,
'Outlet_Type'].apply(lambda x: outlet_size_mode[x])

Check the Null values


display (dt['Outlet_Size'].isnull().sum())

Check the group by count to see if the count increased


dt.groupby (['Outlet_Type','Outlet_Size'] ).agg({'Outlet_Type':[np.size]})
Check Item visibility column with value - 0
display (sum(dt['Item_Visibility']==0))

Replace zeros with mean


dt.loc[:, 'Item_Visibility'].replace([0], [dt['Item_Visibility'].mean()],
inplace=True)

Check any value with 0 again


sum(dt['Item_Visibility']==0)

Check distinct values - Item_Fat_Content


dt['Item_Fat_Content'].value_counts()

Consolidate similar Column Values


dt['Item_Fat_Content'] = dt['Item_Fat_Content'].replace({'LF':'Low Fat',
'reg':'Regular', 'low fat':'Low Fat'})
display (dt['Item_Fat_Content'].value_counts())

Creating New Attributes

Create new attributes with first two characters of item identifier column
dt['New_Item_Type'] = dt['Item_Identifier'].apply(lambda x: x[:2])
display (dt['New_Item_Type'])
Display Number of records in each category
display (dt['New_Item_Type'].value_counts())

Map the values

dt['New_Item_Type'] = dt['New_Item_Type'].map({'FD':'Food', 'NC':'Non-


Consumable', 'DR':'Drinks'})
display (dt['New_Item_Type'].value_counts())

Display distinct values in Item_Fat_Content

display (dt['Item_Fat_Content'].value_counts())

Display the count based on New_Item_Type and Item_Fat_Content

dt.groupby (['New_Item_Type','Item_Fat_Content'] ).agg({'Outlet_Type':


[np.size]})

Update Item_Fat_Content to ‘Non Edible’ where New_Item_Type = Non-


Consumable

dt.loc[dt['New_Item_Type']=='Non-Consumable', 'Item_Fat_Content'] = 'Non-


Edible'
display (dt['Item_Fat_Content'].value_counts())
Display the count based on New_Item_Type and Item_Fat_Content

dt.groupby (['New_Item_Type','Item_Fat_Content'] ).agg({'Outlet_Type':


[np.size]})

Display how many years the outlet is present


2022 (Current year) - 'Outlet_Establishment_Year'

dt['Outlet_Years'] = 2022 - dt['Outlet_Establishment_Year']


print (dt['Outlet_Years'])

Display Top 5 Records


display (dt.head())

Exploratory Data Analysis

Create Dist Plot – Item Weight


sns.distplot(dt['Item_Weight'])
plt.show()

Create Dist Plot – Item Visibility


sns.distplot(dt['Item_Visibility'])
plt.show()
Create Dist Plot – Item MRP

sns.distplot(dt['Item_MRP'])
plt.show()

Create Dist Plot – Item Outlet Sales


sns.distplot(dt['Item_Outlet_Sales'])
plt.show()

Log Transformation to reduce Outliers


# The above dist plot is right skewed, there might be outliers in the right side.
To reduce the outliers, implement log transformation
dt['Item_Outlet_Sales'] = np.log(1+dt['Item_Outlet_Sales'])
display (dt['Item_Outlet_Sales'])

Create Dist Plot – again


sns.distplot(dt['Item_Outlet_Sales'])
plt.show()

Create Count Plot – Number of records in each category


sns.countplot(dt["Item_Fat_Content"])
plt.show()
Create Count Plot – Item Type

# l is the list of unique Item Types - This is used to display X-Label


l = list(dt['Item_Type'].unique())
chart = sns.countplot(dt["Item_Type"])
chart.set_xticklabels(labels=l, rotation=90)
plt.show()

Create Count Plot – Establishment year


Number of stores started per year
sns.countplot(dt['Outlet_Establishment_Year'])
plt.show()

Count Plot Outlet Size


sns.countplot(dt['Outlet_Size'])
plt.show()

Count Plot Outlet Location Type

sns.countplot(dt['Outlet_Location_Type'])
plt.show()

Count Plot Outlet Type


sns.countplot(dt['Outlet_Type'])
plt.show()
Co-relation Matrix

Print Co relation
corr = dt.corr()
display (corr)

Print Co Relation Matrix

sns.heatmap(corr, annot=True, cmap='coolwarm')


plt.show()

Display Top 5 Records

dt.head()

Label Encoding

Label Encoding – Column Outlet Identifier

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
dt['Outlet'] = le.fit_transform(dt['Outlet_Identifier'])
display (dt['Outlet'])
Label Encoding – Remaining columns with For loop

cat_col = ['Item_Fat_Content', 'Item_Type', 'Outlet_Size',


'Outlet_Location_Type', 'Outlet_Type', 'New_Item_Type']
for col in cat_col:
dt[col] = le.fit_transform(dt[col])
display (dt.head())

One hot Encoding


dt = pd.get_dummies(dt, columns=['Item_Fat_Content', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'New_Item_Type'])
display (dt.head())

Create X – Remove un used columns


X = dt.drop(columns=['Outlet_Establishment_Year', 'Item_Identifier',
'Outlet_Identifier', 'Item_Outlet_Sales'])
X.head()

Create y
y = dt['Item_Outlet_Sales']
y.head()

Train Test Split


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=0)
print (X_train.shape, X_test.shape , y_train.shape, y_test.shape)
Function to create Model

from sklearn.model_selection import cross_val_score


from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
def train(model, X, y):
# training the model
model.fit(X, y)

pred = model.predict(X)
# perform cross-validation
cv_score = cross_val_score(model, X, y, scoring='neg_mean_squared_error',
cv=5)
cv_score = np.abs(np.mean(cv_score))

print("Model Report")
print("CV Score:", cv_score)
print("R2_Score:", r2_score(y,pred))

Create Linear Regression Model

from sklearn.linear_model import LinearRegression, Ridge, Lasso


model = LinearRegression(normalize=True)
train(model, X_train, y_train)
coef = pd.Series(model.coef_, X.columns).sort_values()
print (coef)
coef.plot(kind='bar', title="Model Coefficients")
plt.show()

Create Ridge Regression

model = Ridge(normalize=True)
train(model, X_train, y_train)
coef = pd.Series(model.coef_, X.columns).sort_values()
coef.plot(kind='bar', title="Model Coefficients")
plt.show()

Create Lasso Regression


model = Lasso()
train(model, X_train, y_train)
coef = pd.Series(model.coef_, X.columns).sort_values()
coef.plot(kind='bar', title="Model Coefficients")
plt.show()

Decision Tree Regression

from sklearn.tree import DecisionTreeRegressor


model = DecisionTreeRegressor()
train(model,X_train, y_train)
coef = pd.Series(model.feature_importances_,
X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title="Feature Importance")
plt.show()

Random Forest Regression

from sklearn.ensemble import RandomForestRegressor


model = RandomForestRegressor()
train(model, X_train, y_train)
coef = pd.Series(model.feature_importances_,
X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title="Feature Importance")
plt.show()

Extra Tree Regression

from sklearn.ensemble import ExtraTreesRegressor


model = ExtraTreesRegressor()
train(model, X_train, y_train)
coef = pd.Series(model.feature_importances_,
X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title="Feature Importance")
plt.show()
LGBMRegressor

from lightgbm import LGBMRegressor


model = LGBMRegressor()
train(model, X_train, y_train)
coef = pd.Series(model.feature_importances_,
X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title="Feature Importance")
plt.show()

XG Boost Regressor

from xgboost import XGBRegressor


model = XGBRegressor()
train(model, X_train, y_train)
coef = pd.Series(model.feature_importances_,
X.columns).sort_values(ascending=False)
coef.plot(kind='bar', title="Feature Importance")
plt.show()

Random Search CV

from sklearn.model_selection import RandomizedSearchCV


Parameters

max_features = ['auto', 'sqrt']


max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10]

Param Grid

random_grid = {
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf}

print(random_grid)

Random Forest Regression

rf = RandomForestRegressor()
rf=RandomizedSearchCV(estimator = rf, param_distributions =
random_grid,scoring='neg_mean_squared_error', n_iter = 10, cv = 5,
verbose=2, random_state=42, n_jobs = 1)
display (rf.fit(X_train, y_train))
Best Parameters

print(rf.best_params_)
print(rf.best_score_)
predictions=rf.predict(X_test)
display (r2_score (y_test,predictions))
display (predictions)

Create the Dist plot


sns.distplot(y_test-predictions)
plt.show()

Parameter for LGBM Regressor

from scipy.stats import uniform, randint


params = {
"gamma": uniform(0, 0.5),
"learning_rate": uniform(0.03, 0.3), # default 0.1
"max_depth": randint(2, 6), # default 3
"n_estimators": randint(100, 150), # default 100
"subsample": uniform(0.6, 0.4)
}
Model LGBM Regressor

lgb=LGBMRegressor()
lgb = RandomizedSearchCV(estimator = lgb, param_distributions =
params,scoring='neg_mean_squared_error', n_iter = 10, cv = 5, verbose=2,
random_state=42, n_jobs = 1)
lgb.fit(X,y)

Best Parameter

print(lgb.best_params_)
print(lgb.best_score_)
predictions=lgb.predict(X_test)
display (r2_score (y_test,predictions))
display (predictions)

Create Dist Plot

sns.distplot(y_test-predictions)
plt.show()
Model XG Boost

params = {
"gamma": uniform(0, 0.5),
"learning_rate": uniform(0.03, 0.3), # default 0.1
"max_depth": randint(2, 6), # default 3
"n_estimators": randint(100, 150), # default 100
"subsample": uniform(0.6, 0.4)
}

XG Boost Regressor

xgb = RandomizedSearchCV(estimator = model, param_distributions =


params,scoring='neg_mean_squared_error', n_iter = 10, cv = 5, verbose=2,
random_state=42, n_jobs = 1)
xgb.fit(X,y)

Print Best Parameter

print(xgb.best_params_)
print(xgb.best_score_)
predictions=xgb.predict(X_test)
display (r2_score (y_test,predictions))
display (predictions)
Create Dist plot

sns.distplot(y_test-predictions)
plt.show()

You might also like