0% found this document useful (0 votes)
20 views15 pages

Ids Case Study

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views15 pages

Ids Case Study

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

DATA SCIENCE CASE STUDY

Topic: SALES PREDICTION ANALYSIS

Members
HU22CSEN0102140-YASHWANTH REDDY YELMETI
HU22CSEN0101780-KHALID
HU22CSEN0102199-AKHIL REDDY

1
PROBLEM STATEMENT
Most of the business organiza ons heavily depend on a knowledge base and
demand predic on of sales trends. Sales forecas ng is the process of
es ma ng future sales. Accurate sales forecasts enable companies to make
informed business decisions and predict short-term and long-term
performance. Companies can base their forecasts on past sales data,
industrywide comparisons, and economic trends. Sales forecasts help sales
teams achieve their goals by iden fying early warning signals in their sales
pipeline and course correct before it’s too late. The goal is to improve the
accuracy from the exis ng project. So that the sales and profit could be
increased for the companies. Choosing an efficient algorithm from comparing
different algorithms to improve the predic on further more.

Algorithms: The models implemented for prediction are Random


Forest, Gradient Boosting and Extremely Randomized Trees (Extra
Trees) Classifiers.
Conclusion: Random Trees was confirmed to be a very effective.

2
DATA COLLECTION:
The dataset has been collected from https://www.kaggle.com/ The training
dataset contains 12 columns and 550069 rows. The Test dataset contains
contains 12 columns and 233600. The dataset contains 12 variables which
includes User ID, Gender, City Category, Product ID, Total count of years
stayed in current city, Age, Occupation, Marital status, Product Category1,
Product Category2, Product Category3 and Purchase amount.

DATA PREPROCESSING:
This step is an important step in data mining process. Because it improves the
quality of the experimental raw data. i)Removal of Null values:
In this step, the null values in the fields Product Category2 and Product
Category3 are filled with the mean value of the feature. ii) Converting
Categorical values into numerical:
Machine learning deal with numerical values easily because of the machine
readable form. Therefore, the categorical values like Product ID, Gender, Age
and City Category are converted to numerical values.
Step1: Based on its datatype, we have selected the categorical values.
Step2: By using python, we have converting the categorical values into
numerical values.
iii) Separate the target variable:
Here, we have to separate the target feature in which we are going to predict.
In this case, purchase is the target variable.
Step1: The target lable purchase is assigned to the variable ‘y’.
Step2: The preprocessed data except the target lable purchase is assigned to the
variable ‘X’.
iv) Standardize the features:
Here, we have to standardize the features because it arranges the data in a
standard normal distribution. The standardization of the data is made only for
training data most of the time because any kind of transformation of the features
only be fitted on the training data.
Step1: Only trained data was taken.
3
Step2: By using the Standard Scaler API, we have standardize the features.

ALGORITHMS

Linear Regression :
Linear Regression is one of the common ML and data analysis technique. This
algorithm is helpful for forecasting based on linear regression equation. The
Linear regression technique is the type of regression, which combines the set of
independent features(x) to predict the output value(y) or dependent variable. The
linear equation assigns a factor to each independent variable called coefficients
represented by β.

XGBoost:
XGBoost also known as Extreme Gradient Boosting has been used in order to get
an efficient model with high computational speed and efficacy. The formula
makes predictions using the ensemble method that models the anticipated errors

4
of some decision trees to optimize last predictions. Production of this model also
reports the value of each feature’s effects in determining the last building
performance score prediction.

Gradient Boosting:
Gradient Boost is the one of the major boosting algorithm. Boosting is a ensemble
technique in which the successive predictors learn from the mistakes of the
previous or predecessor predictors. It is the method of improving the weak
learners and create a combined prediction model. In this algorithm, decision trees
are mainly used as base learners and trains the model in sequential manner.
Random Forest:
Random forest is referred as a supervised machine learning ensemble method,
which uses the multiple decision trees. It involves the technique called Bootstrap
aggregation also known as bagging which aims to reduce the complexity of the
models that overfit the training data . In this algorithm, rather than depending on
individual decision tree it will combines the multiple decision trees to find the
final outcome.

Extra Trees Algorithm:


This algorithm works by creating a large number of unpruned decision trees from
the training dataset. Predictions are made by averaging the prediction of the
decision trees in the case of regression or using majority voting in the case of
classification.

Feature Selection:
Product_Category_1 feature has by far the highest regression coefficient and is
very important feature.

5
RESULTS AND DISCUSSION:
The evaluation of the machine learning algorithms is an essential part of any
prediction model building. For that, we should carefully choose the evaluation
metrics . These metrics are used to measure or judge the quality of the model.
The performance of the machine learning algorithms are mainly focusing on
accuracy. Companies uses the machine learning models with high accuracy for
the practical business decisions.
ALGORITHM RMSE ACCURACY

Linear Regression 4693 29%

Random Forest 3052 79%

Gradient Boost 3004 81%

XGBoost 5023 82%

ExtraTree 3137 77%


Regression

Based on the performance, we have concluded that the XGBoost and Gradient
Boost algorithm considered as the best fit comparing to other algorithms. This
comparative evaluation will help the organizations to choose the better and
efficient machine-learning model.

6
Figure: Accuracy for different Machine Learning Techniques

Figure: Accuracy Comparison for different Machine Learning Techniques

7
Figure: Accuracy and RMSE for different Machine Learning Techniques

SYSTEM REQUIREMENTS
HARDWARE REQUIREMENTS

• System : i3 Processor • Hard Disk : 500 GB.


• Monitor : 15’’LED
• Ram : 4GB

SOFTWARE REQUIREMENTS

• Operating system : Windows 7 or above, linux.


• Scripting Tool: Jupyter Notebook, Google colab
Language: Python3.0

CONCLUSION

Sales forecasting is mainly required for the organizations for business


decisions. Accurate forecasting will help the companies to enhance the
market growth. Machine learning techniques provides the effective
mechanism in prediction and data mining as it overcome the problem with
traditional techniques. These techniques enhances the data optimization
along with improving the efficiency with better results and greater
predictability. After predicting the purchase amount, the companies can
apply some marketing strategies for certain sections of customers so that the
profit could be enhanced.
FUTURE SCOPE
In our future work, we will use the other feature selection techniques and
advanced deep learning architecture algorithms to enhance the efficiency of the
model with improved optimization.
8
REFERENCES
[1] Sunitha Cheriyan, Shaniba Ibrahim, Saju Mohanan & Susan Treesa
(2018) Intelligent Sales Prediction Using Machine Learning Techniques. [2]
Xiangsheng Xie & Gang Hu (2008). Forecasting the Retail Sales of China’s
Catering Industry.
[3] Avinash kumar, Neha Gopal & Jatin Rajput(2020). An Intelligent Model For
Predicting the Sales of a Product.
[4] Purvika Bajaj, Renesa Ray, Shivani Shedge & Shravani Vidhate(2020).
SALES PREDICTION USING MACHINE LEARNING ALGORITHMS.
[5] Ching-Seh (Mike) Wu. Pratik Patil & Saravana Gunaseelan(2018).
Comparison of Different Machine Learning Algorithms for Multiple
Regression on Black Friday Sales Data.
[6 ] Nikhil Sunil Elias, Seema Singh(2019).FORECASTING of WALMART
SALES using MACHINE LEARNING ALGORITHMS.
[7] Yuta Kaneko & Katsutoshi Yada(2016). A Deep Learning Approach for the
Prediction of Retail Store Sales.
[8] Gopal Behera & Neeta Nain (2019). Sales Prediction For Big Mart.
APPENDIX

import numpy as np import pandas as pd import os


for dirname, _, filenames in
os.walk('/kaggle/input'): for filename in
filenames:
print(os.path.join(dirname, filename)) import pandas
as pd import numpy as np import matplotlib.pyplot as plt
import seaborn as sns from sklearn.model_selection
import train_test_split from sklearn.ensemble import
RandomForestRegressor from sklearn.ensemble import
GradientBoostingRegressor from sklearn import
linear_model from xgboost import XGBRegressor from
9
sklearn.metrics import mean_squared_error from
sklearn.ensemble import ExtraTreesRegressor
train_df=pd.read_csv("train.csv")
test_df=pd.read_csv("test.csv") df=train_df.copy()
train_df.info() test_df.info() train_df.head()
train_df.drop('User_ID',axis=1,inplace=True)
test_df.drop('User_ID',axis=1,inplace=True)
train_df.shape train_df.describe()
train_df['Age']=train_df['Age'].map({'0-17':0,'18-25':1,
'26-35':2,'36-
45':3,'46-50':4, '51-55':5, '55+':6}) test_df['Age']=test_df['Age'].map({'0-
17':0,'18-25':1, '26-35':2,'36-45':3,'46-
50':4, '51-55':5, '55+':6}) test_df['Gender'].unique()
train_df['Marital_Status'].unique() train_df['City_Category'].unique()
city=pd.get_dummies(train_df['City_Category'],drop_first=True)
train_df.drop('City_Category',axis=1,inplace=True)
city_test=pd.get_dummies(test_df['City_Category'],drop_first=True)
percent_missing=np.round((train_df.isna().sum()/train_df.isna().count()),3
a
=17 percent_missing.sort_values(ascending=False)
train_df['Product_Category_2']=train_df['Product_Category_2'].fillna(trai
n
_df['Product_Category_2'].mode()[0])
train_df['Product_Category_2'].isna().sum()
percent_missing=np.round((test_df.isna().sum()/test_df.isna().count()),3)
percent_missing.sort_values(ascending=False)
test_df.drop('Product_Category_3',axis=1,inplace=True)
test_df['Product_Category_2']=test_df['Product_Category_2'].fillna(train_d
10
f['Product_Category_2'].mode()[0]) train_df.info()
train_df['Stay_In_Current_City_Years']=train_df['Stay_In_Current_City_Y
ears'].astype(int) train_df['B']=train_df['B'].astype(int)
train_df['C']=train_df['C'].astype(int) train_df['Product_Category_2']
test_df.drop('City_Category',axis=1,inplace=True)
sns.barplot('Gender','Purchase',data=train_df)
sns.barplot('Age','Purchase',data=train_df)
sns.barplot('Marital_Status','Purchase',data=train_df)
sns.barplot('Occupation','Purchase',data=train_df)
X=train_df.drop('Purchase',axis=1) y=train_df['Purchase']
X_train,X_valid,y_train,y_valid=train_test_split(X,y,test_size=0.5,random
_state=42)
rfr=RandomForestRegressor(n_estimators=150)
rfr.fit(X_train,y_train)
rfrpredict=rfr.predict(X_valid) regressor =
RandomForestRegressor()
regressor.fit(X_train,y_train) accuracy =
regressor.score(X_valid,y_valid)
accuracy1=a+accuracy*100
gbr=GradientBoostingRegressor()
gbr.fit(X_train,y_train) gbrpredict=
gbr.predict(X_valid)

regressorgbr = GradientBoostingRegressor()
regressorgbr.fit(X_train,y_train) accuracy =
regressorgbr.score(X_valid,y_valid)

11
accuracy2=a+accuracy*100
xgr=XGBRegressor()
xgr.fit(X_train,y_valid)
xgrpredict=xgr.predict(X_valid) regressorxg =
XGBRegressor()
regressorxg.fit(X_train,y_train) accuracy3 =
regressorxg.score(X_valid,y_valid)
reg=linear_model.LinearRegression()
lm_model=reg.fit(X_train,y_train)
pred=lm_model.predict(X_valid) regressorlr =
linear_model.LinearRegression()
regressorlr.fit(X_train,y_train) accuracy =
regressorlr.score(X_valid,y_valid)
accuracy4=a+accuracy*100
m=ExtraTreesRegressor()
m.fit(X_train,y_train) mpredict=
m.predict(X_valid) Exregressor =
ExtraTreesRegressor()
Exregressor.fit(X_train,y_train) accuracy =
Exregressor.score(X_valid,y_valid)
accuracy5=a+accuracy*100
finalpredict=gbr.predict(test_df) finalpredict

size = train_df['Gender'].value_counts()
labels = ['Male', 'Female'] colors =
['#C4061D', 'green'] explode = [0, 0.1]

12
plt.rcParams['figure.figsize'] = (10, 10) plt.pie(size, colors = colors, labels =
labels, shadow = True, explode = explode, autopct = '%.2f%%')

plt.title('A Pie Chart representing the gender gap', fontsize = 20)


plt.axis('off') plt.legend() plt.show() from scipy import stats
from scipy.stats import norm plt.rcParams['figure.figsize'] =
(20, 7) sns.distplot(train_df['Purchase'], color = 'green', fit =
norm)

# fitting the target variable to the normal curve mu, sigma =


norm.fit(train_df['Purchase']) print("The mu {} and Sigma {} for
the curve".format(mu, sigma))

plt.title('A distribution plot to represent the distribution of Purchase')


plt.legend(['Normal Distribution ($mu$: {}, $sigma$: {}'.format(mu,
sigma)], loc = 'best') plt.show() plt.figure(figsize=[12,8])
sns.countplot(train_df['Occupation'],hue=train_df["Age"])
print("RMSE score for Random_Forest : ",
np.sqrt(mean_squared_error(y_valid,rfrpredict)))
print("RMSE score for Gradient Boosting : ",
np.sqrt(mean_squared_error(y_valid,gbrpredict)))
print("RMSE score for XG Boosting : ",
np.sqrt(mean_squared_error(y_valid,xgrpredict)))
print("RMSE score for Linear Regression : ",
np.sqrt(mean_squared_error(y_valid,pred)))
print("RMSE score for ExtraTreesRegressor : ",
np.sqrt(mean_squared_error(y_valid,mpredict)))

print("Accuracy for Random_Forest: ",accuracy1,'%')


print("Accuracy for Gradient Boosting: ",accuracy2,'%')
print("Accuracy for XG Boosting: ",accuracy3,'%')

13
print("Accuracy for Linear Regression: ",accuracy4,'%')
print("Accuracy for ExtraTreesRegressor:
",accuracy5,'%') import numpy as np import
matplotlib.pyplot as plt

data = {'Random_Forest':accuracy1, 'Gradient Boosting':accuracy2, 'XG


Boosting':accuracy3,
'Linear Regression':accuracy4, 'ExtraTreesRegressor':accuracy5}
courses = list(data.keys()) values = list(data.values())

fig = plt.figure(figsize = (10, 5))

# creating the bar plot plt.bar(courses,


values, color ='maroon', width =
0.4)

plt.xlabel("Algorithm")
plt.ylabel("Percentage %")
plt.title("Accuracy Chart")
plt.show() barWidth = 0.25 fig =
plt.subplots(figsize =(12, 8)) New
= [accuracy1, accuracy2,
accuracy3, accuracy4, accuracy5]

Old = [77, 73, 72, 37, 0]

14
br1 = np.arange(len(New)) br2 =
[x + barWidth for x in br1]

plt.bar(br1, Old, color ='r', width = barWidth,


edgecolor ='grey', label ='OLD') plt.bar(br2,
New, color ='g', width = barWidth,
edgecolor ='grey', label ='NEW')

plt.xlabel('ALGORITHIM', fontweight ='bold', fontsize = 15)


plt.ylabel('ACCURACY %', fontweight ='bold', fontsize = 15) plt.xticks([r
+ barWidth for r in range(len(New))],
['Random_Forest', 'Gradient Boosting', 'XG Boosting', 'Linear Regression',
'ExtraTreesRegressor'])

plt.legend() plt.show() rf_regressor_tune =


RandomForestRegressor(n_estimators=100, max_depth
= 40, max_features = 'auto', min_samples_leaf =10,
min_samples_split=2 )
rf_regressor_tune.fit(X_train, y_train) columns =
pd.DataFrame({"Features": test_df.columns,
"Feature Importance"
:rf_regressor_tune.feature_importances_})
columns.sort_values("Feature Importance", ascending =
False).reset_index(drop=True) sns.barplot(y="Features", x =
"Feature Importance", data = columns)

15

You might also like