Ids Case Study
Ids Case Study
Members
HU22CSEN0102140-YASHWANTH REDDY YELMETI
HU22CSEN0101780-KHALID
HU22CSEN0102199-AKHIL REDDY
1
PROBLEM STATEMENT
Most of the business organiza ons heavily depend on a knowledge base and
demand predic on of sales trends. Sales forecas ng is the process of
es ma ng future sales. Accurate sales forecasts enable companies to make
informed business decisions and predict short-term and long-term
performance. Companies can base their forecasts on past sales data,
industrywide comparisons, and economic trends. Sales forecasts help sales
teams achieve their goals by iden fying early warning signals in their sales
pipeline and course correct before it’s too late. The goal is to improve the
accuracy from the exis ng project. So that the sales and profit could be
increased for the companies. Choosing an efficient algorithm from comparing
different algorithms to improve the predic on further more.
2
DATA COLLECTION:
The dataset has been collected from https://www.kaggle.com/ The training
dataset contains 12 columns and 550069 rows. The Test dataset contains
contains 12 columns and 233600. The dataset contains 12 variables which
includes User ID, Gender, City Category, Product ID, Total count of years
stayed in current city, Age, Occupation, Marital status, Product Category1,
Product Category2, Product Category3 and Purchase amount.
DATA PREPROCESSING:
This step is an important step in data mining process. Because it improves the
quality of the experimental raw data. i)Removal of Null values:
In this step, the null values in the fields Product Category2 and Product
Category3 are filled with the mean value of the feature. ii) Converting
Categorical values into numerical:
Machine learning deal with numerical values easily because of the machine
readable form. Therefore, the categorical values like Product ID, Gender, Age
and City Category are converted to numerical values.
Step1: Based on its datatype, we have selected the categorical values.
Step2: By using python, we have converting the categorical values into
numerical values.
iii) Separate the target variable:
Here, we have to separate the target feature in which we are going to predict.
In this case, purchase is the target variable.
Step1: The target lable purchase is assigned to the variable ‘y’.
Step2: The preprocessed data except the target lable purchase is assigned to the
variable ‘X’.
iv) Standardize the features:
Here, we have to standardize the features because it arranges the data in a
standard normal distribution. The standardization of the data is made only for
training data most of the time because any kind of transformation of the features
only be fitted on the training data.
Step1: Only trained data was taken.
3
Step2: By using the Standard Scaler API, we have standardize the features.
ALGORITHMS
Linear Regression :
Linear Regression is one of the common ML and data analysis technique. This
algorithm is helpful for forecasting based on linear regression equation. The
Linear regression technique is the type of regression, which combines the set of
independent features(x) to predict the output value(y) or dependent variable. The
linear equation assigns a factor to each independent variable called coefficients
represented by β.
XGBoost:
XGBoost also known as Extreme Gradient Boosting has been used in order to get
an efficient model with high computational speed and efficacy. The formula
makes predictions using the ensemble method that models the anticipated errors
4
of some decision trees to optimize last predictions. Production of this model also
reports the value of each feature’s effects in determining the last building
performance score prediction.
Gradient Boosting:
Gradient Boost is the one of the major boosting algorithm. Boosting is a ensemble
technique in which the successive predictors learn from the mistakes of the
previous or predecessor predictors. It is the method of improving the weak
learners and create a combined prediction model. In this algorithm, decision trees
are mainly used as base learners and trains the model in sequential manner.
Random Forest:
Random forest is referred as a supervised machine learning ensemble method,
which uses the multiple decision trees. It involves the technique called Bootstrap
aggregation also known as bagging which aims to reduce the complexity of the
models that overfit the training data . In this algorithm, rather than depending on
individual decision tree it will combines the multiple decision trees to find the
final outcome.
Feature Selection:
Product_Category_1 feature has by far the highest regression coefficient and is
very important feature.
5
RESULTS AND DISCUSSION:
The evaluation of the machine learning algorithms is an essential part of any
prediction model building. For that, we should carefully choose the evaluation
metrics . These metrics are used to measure or judge the quality of the model.
The performance of the machine learning algorithms are mainly focusing on
accuracy. Companies uses the machine learning models with high accuracy for
the practical business decisions.
ALGORITHM RMSE ACCURACY
Based on the performance, we have concluded that the XGBoost and Gradient
Boost algorithm considered as the best fit comparing to other algorithms. This
comparative evaluation will help the organizations to choose the better and
efficient machine-learning model.
6
Figure: Accuracy for different Machine Learning Techniques
7
Figure: Accuracy and RMSE for different Machine Learning Techniques
SYSTEM REQUIREMENTS
HARDWARE REQUIREMENTS
SOFTWARE REQUIREMENTS
CONCLUSION
regressorgbr = GradientBoostingRegressor()
regressorgbr.fit(X_train,y_train) accuracy =
regressorgbr.score(X_valid,y_valid)
11
accuracy2=a+accuracy*100
xgr=XGBRegressor()
xgr.fit(X_train,y_valid)
xgrpredict=xgr.predict(X_valid) regressorxg =
XGBRegressor()
regressorxg.fit(X_train,y_train) accuracy3 =
regressorxg.score(X_valid,y_valid)
reg=linear_model.LinearRegression()
lm_model=reg.fit(X_train,y_train)
pred=lm_model.predict(X_valid) regressorlr =
linear_model.LinearRegression()
regressorlr.fit(X_train,y_train) accuracy =
regressorlr.score(X_valid,y_valid)
accuracy4=a+accuracy*100
m=ExtraTreesRegressor()
m.fit(X_train,y_train) mpredict=
m.predict(X_valid) Exregressor =
ExtraTreesRegressor()
Exregressor.fit(X_train,y_train) accuracy =
Exregressor.score(X_valid,y_valid)
accuracy5=a+accuracy*100
finalpredict=gbr.predict(test_df) finalpredict
size = train_df['Gender'].value_counts()
labels = ['Male', 'Female'] colors =
['#C4061D', 'green'] explode = [0, 0.1]
12
plt.rcParams['figure.figsize'] = (10, 10) plt.pie(size, colors = colors, labels =
labels, shadow = True, explode = explode, autopct = '%.2f%%')
13
print("Accuracy for Linear Regression: ",accuracy4,'%')
print("Accuracy for ExtraTreesRegressor:
",accuracy5,'%') import numpy as np import
matplotlib.pyplot as plt
plt.xlabel("Algorithm")
plt.ylabel("Percentage %")
plt.title("Accuracy Chart")
plt.show() barWidth = 0.25 fig =
plt.subplots(figsize =(12, 8)) New
= [accuracy1, accuracy2,
accuracy3, accuracy4, accuracy5]
14
br1 = np.arange(len(New)) br2 =
[x + barWidth for x in br1]
15