Main Project
Main Project
BACHELOR OF TECHNOLOGY
In
K.Chaitanya(19A81A05L2)
K.V.N.S.S.Krishna Sreekar(19A81A05K7)
Pedatadepalli, Tadepalligudem
This is to certify that the Project Report entitled “Machine Learning Application for Black Friday
sales prediction framework” submitted by M.Bhavya Lakshmi Siridevi (19A81A05M5),
K.Chaitanya (19A81A05L2), K.V.N.S.S.Krishna Sreekar(19A81A05K7) for the award of the
degree of Bachelor of Technology in the Department of Computer Science and Engineering
during the academic year 2022-2023.
External Examiner
DECLARATION
We here by declare that the project report entitled “Machine Learning Application for Black
Friday Sales prediction framework” submitted by us to Sri Vasavi Engineering
College(Autonomous), Tadepalligudem, affiliated to JNTUK Kakinada in partial fulfillment of
the requirement for the award of the degree of B.Tech in computer Science and Engineering is a
record of Bonafide project work carried out by us under the guidance of Mr.K.Lakshmaji,
Assistant Professor. We further declare that the work reported in this project has not been
submitted and will not be submitted, either in part or in full, for the award of any other degree in
this institute or any other institute or University.
Project Associates
M.Bhavya Lakshmi Siridevi(19A81A05M5)
K.Chaitanya(19A81A05L2)
K.V.N.S.S.Krishna Sreekar(19A81A05K7)
ACKNOWLEDGEMENT
First and foremost, we sincerely salute to our esteemed institute SRI VASAVI ENGINEERING
COLLEGE, for giving us this golden opportunity to fulfill our warm dream to become an
engineer.
Our sincere gratitude to our project guide Mr.K.Lakshmaji, Assistant professor, Department of
Computer Science and Engineering, for his timely cooperation and valuable suggestions while
carrying out this project.
We express our sincere thanks and heartful gratitude to Dr.D.Jaya Kumari, professor & Head
of the Department of Computer Science and Engineering, for permitting us to do our project.
Our special thanks to the management and all the teaching and non-teaching staff members,
Department of Computer Science and Engineering, for their support and cooperation in various
ways during our project work. It is our pleasure to acknowledge the help of all those respected
individuals.
We would like to express our gratitude to our parents, friends who helped to complete this
project.
Project Associates
M.Bhavya Lakshmi Siridevi(19A81A05M5)
K.Chaitanya(19A81A05L2)
K.V.N.S.S.Krishna Sreekar(19A81A05K7)
TABLE OF CONTENTS
S.NO TITLE PAGENO
ABSTRACT i
1 INTRODUCTION 1
1.1 Introduction 1
1.2 Scope 1
1.3 Objective 1
2 LITERATURE SURVEY 2
i
CHAPTER 1
INTRODUCTION
1.1 Introduction
“Black Friday” is the name given to the shopping day after thanksgiving. This day was actually
considered as “Black Friday” on the grounds that the number of customers made auto collisions
and some of the time even violence. Police begat the saying to depict the disorder encompassing
the congestion of pedestrian and auto traffic in downtown shopping regions.
In retail industry, the number of sales play an important part that decides the loss or profit for
the company. Predicting the sales accurately gives the efficient industry management. Black
Friday is like a carnival sale in the USA. In this day huge sale occurs in a very less price for the
products which are much demanded. To incur the sales, a prediction model is made to hover on
the type of product which is sold in maximum numbers. A customer’s behavior is to be analyzed
in order to predict the amount of purchase to be done by him/her on a particular day. In this
paper, we will predict the sales of a company on “Black Friday”. To predict the sales of different
products based on their independent variables, we need to analyze the relationship between
different variables and well organize the darn. So that a model can perform calculations and
predicts sales accurately.
1.2 Scope
1.3 Objective
Analyzing the data of all the customers and finding relationship of independent variables
with respect to the target variable.
Predicting the expected sales by testing and training.
1
CHAPTER-2
LITERATURE SURVEY
2
CHAPTER -3
Problem Statement
A retail company ABC wants to understand the customer purchase behavior (specifically,
purchase amount) against various products of different categories. They have shared
purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type,
stay_in_current_city), product details (product_id and product category) and Total
purchase_amount from last month.
Existing System
In the existing system they used the Machine Learning Algorithms like Linear regression and
Ridge regression in machine learning.
Time Consuming.
Less Efficient.
High Error rate.
Using Linear regression algorithm it shows the MAE value is 86.1
Using Ridge regression algorithm it shows the MAE value is 86
Proposed System
The proposed system uses the Machine Learning Algorithms of Random Forest and
K-Nearest Neighbour Algorithms. The system shows the classification of sales prediction
with the high accuracy.
3
Advantages of Proposed System
More Accurate.
Easily identifies trends and patterns.
Quick and Efficient in Handling of Data.
It can be more effective if the training data is large.
Functional Requirements
4
Non-Functional Requirements
Security
Reliability
The System is more reliable because of the qualities that are inherited from the chosen
platform python. The code built by using python is more reliable.
Performance
Maintainability
The maintenance group should be able to fix any problem occur suddenly.
5
System Requirements
Hardware Requirements
System : Computer
Processor : I3 or Above
RAM : 8 GB
Software Requirements
6
CHAPTER-4
SYSTEM DESIGN
Prediction Machine Learning Application for Black Friday Sales prediction framework.
Usually the systemsare designed to use this learned knowledge to raised process similar input
within the future. AML algorithm is one that can learn from experience (observed examples)
with respect to someclass of tasks and a performance measure. Classification which is
additionally mentioned aspattern recognition technique is a crucial task by which machines
“learn” to
automaticallyrecognizecomplexpatterns,todifferentiatebetweenexemplarsbasedontheirdiffere
ntpatterns, and to formintelligent decisions which tends to give the proper output with
maximumaccuracy. This Design approachincludes several steps in itsimplementation
theyare:
1. Data Sets:
Model is fully trained by giving complete data as input on which supervised learning can
be done. This dataset contains 8523 observations and 12 features like product ID, User
ID, Age, Gender, Occupation,stay_in_current_city_years, marital_status
Product_category_1, Product_category_2, Product_category_3, and Purchase.
Thedataset is splitted into Trainingset andtesting set. Generally 80:20 ratio isapplied to
split the training and testing dataset. Thedata model which was createdusing Random
Forest and K-NN algorithm is applied on the training set and based onthetest
resultaccuracy the test setprediction will be done.
2. Data Preprocessing:
The data must be pre-processed before applying any machine learning (ML) algorithm to
our dataset. Also, it is necessary to convert the data into a certain form that an ML
algorithm can predict the value of the purchase variable, given customer information as
input.
It handles the missing or NaN values.
7
It can ignore a particular field if that’s not useful for data analysis/prediction.
It can replace the categorical value with some numerical value.
For the prediction analysis–It is necessary to transform the data available in the
dataset whenever certain data is not acceptable for prediction. Here, the column age that’s
having the values of different ranges should to be transformed
3. Data Cleaning:
In a dataset, there might be some missing values which needs be get rid of. The missing
values either must be filled with null values of any mean values or it should be dropped
from the dataset. The missing values can create discrepancies in the result.
4. Building the regression model using machine learning algorithms:
For predicting the sales Random Forest algorithm and K-NN algorithm isused. It is
effective because it provides better results in regression problem. It is extremelyintuitive,easy
toimplementandprovideinterpretablepredictions.Itproducesoutofbagestimated errorwhich was
proven to be unbiased in many tests. It is relatively easy to tunewith.Itgiveshighest
accuracyresultand the low error rate for the problem.
5. Data Prediction:
Predictionreferstotheoutputofanalgorithmafterithasbeentrainedonahistoricaldatasetan applied
to new data when forecasting the likelihood of aparticularoutcome.
8
4.3 Internal Architecture:
The Unified Modelling Language allows the software engineer to express an analysismodel
using the behaviour notation that is governed by a set of syntactic semantic and pragmaticrules.
A UML system is represented usingfive different views that describe the system
fromdistinctlydifferent perspective.Each viewisdefined bya setofdiagram,whichis asfollows.
The usecase diagram is dynamic in nature, there should be some internal or externalfactors for
making the interactions.These internal and external agents are known as actors. Usecase
diagrams consistsof actors, use cases andtheir relationships. The diagram is used tomodel the
system/subsystem of an application.A single use case diagram captures a particularfunctionality
of a system. Hence to model the entire system, a number of use case diagrams areused.
9
Purpose of Usecase Diagrams
Use case diagrams are used to gather the requirements of a system including internal andexternal
influences. These requirements are mostly design requirements. Hence, when a
systemisanalyzedtogather itsfunctionalities, use casesarepreparedandactors areidentified. When
the initial task is complete, use case diagrams are modeled to present the outside view. In brief,
the purposes of use case diagrams can be said to be as follows-
10
4.4.2 Sequence Diagram
The sequence diagram represents the flow of messages in the system and isalso termed as an
event diagram. It helps in envisioning several dynamic scenarios. Itportraysthe communication
betweenanytwo lifelines as atime-ordered sequenceof events,suchthat theselifelines took part
attheruntime.In UML, thelifelineisrepresented by a vertical bar, whereas the message flow is
represented by a vertical dotted line thatextends
acrossthebottomofthepage.Itincorporatestheiterationsaswellasbranching.
1. Tomodel high-levelinteractionamongactiveobjectswithinasystem.
2. To modelinteractionamongobjectsinsideacollaborationrealizingausecase.
3. Iteithermodelsgenericinteractionsorsome certaininstancesofinteraction.
11
CHAPTER-5
TECHNOLOGIES
5.11 About Python
Python is currently the most widely used multi-purpose, high-level programming
language.Pythonisaninterpreted,object-oriented,high-
levelprogramminglanguagewithdynamicsemantics. Its high-level built-in data structures,
combined with dynamic typing and dynamicbinding, make it very attractive for Rapid
Application Development, as well as for use as ascripting or glue language to connect existing
components together. Python's simple, easy tolearn syntax emphasizes readability and therefore
reduces the cost of program maintenance.Python supports modules and packages,which
encourages program modularity and code reuse.The Python interpreter and the extensivestandard
library are available in source or binary formwithout charge for all major platforms, and can be
freely distributed. The biggest strength ofPythonishugecollectionofstandard librarywhichcanbe
usedfor thefollowing
Machine Learning
GUIApplications(like Kivy,Tkinter,PyQtetc. )
Webframeworkslike Django(usedbyYouTube,Instagram,Dropbox)
Imageprocessing(likeOpenCV,Pillow)
Webscraping(like Scrapy,BeautifulSoup,Selenium)
Testframeworks
Multimedia
Advantages of python:
Extensive Libraries:
Python downloads with an extensive library and it contain code forvarious purposes like regular
expressions, documentation-generation, unittesting,web browsers,threading, databases, CGI,
12
email, image manipulation, and more. So, we don’t have to write thecompletecodefor that
manually.
Extensible:
As we have seen earlier, Python can be extended to other languages.You canwrite some of your
code in languages like C++ or C. This comes in handy, especially inprojects.
Embeddable:
Complimentary to extensibility, Python is embeddable as well. Youcan putyour Python code in
your source code of a different language, like C++. This lets us
addscriptingcapabilitiestoourcodeinthe otherlanguage.
Improved Productivity:
Thelanguage’ssimplicityandextensivelibrariesrenderprogrammersmore productive than
languages like Java and C++ do. Also, the fact that you needtowrite less andget more thingsdone.
Readable:
Because it is not such a verbose language, reading Python is much like readingEnglish. This is
the reason why it is so easy to learn, understand, and code. It also does not needcurly braces to
define blocks, and indentation is mandatory. These further aids the readability ofthecode.
Object-Oriented:
Thislanguagesupportsboththeproceduralandobject-orientedprogramming paradigms. While
functions help us with code reusability, classes and objects letusmodel the realworld.A
classallows the encapsulation ofdata and functionsintoone.
13
Free and open source:
Like we said earlier, Python is freely available. But notonly canyou download python for free,
but you can also download its source code, make changes to it,and even distribute it. It
downloads with an extensive collection of libraries to help you withyourtasks.
Portable:
When you code your project in a language like C++, you may need to makesomechanges to it if
you want to run it on another platform. But it isn’t the same withPython. Here,you,need to code
only once, and you can run it any where. This iscalled Write Once RunAnywhere
(WORA).However,youneedtobecarefulenoughnottoinclude any systemdependentfeatures.
Interpreted:
Lastly,wewillsay thatitisaninterpretedlanguage.Sincestatementsareexecuted one
byone,debuggingis easier thanincompiled languages.
Disadvantages of Python:
Speed Limitations:
We have seen that Python code is executed line by line. But since python is interpreted ,I to ften
results in slow execution .This, however,isn’t a problem unless speed is a focal point for the
project. In otherwords, unless high speed is requirement ,the benefits offered byPython are
enough to distractus from its speed limitations.
Design Restrictions:
As you know, Python is dynamically-typed. This means that you don’t need to declare the type
of variable while writing the code. It uses duck typing. But wait, what’s that ? Well, it just means
14
that if it looks like a duck, it must be a duck. While this is easy on the programmers during
coding, it can raise runtime errors.
1. NumPy
2. Pandas
3. Matploblib
4. Seaborn
5. Scikit learn
1. NumPy:
NumPy is a library for the Python programming language, adding support for large,
multi-dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays.
NumPy is a Python library used for working with arrays. It also has functions for working
in domain of linear algebra, fourier transform, and matrices. NumPy was created in
2005by Travis Oliphant. It is an open source project and you can use it freely. It is the
fundamental package for scientific computing with Python.
2. Pandas:
Pandas is a software library written for the Python programming language for data
manipulation and analysis. In particular, it offers data structures and operations for
manipulating numerical tables and time series. It is free software released under the three-
clause BCD license.
Pandas is a software library written for the Python programming language for data
manipulation and analysis. In particular, it offers data structures and operations for
15
manipulating numerical tables and time series. pandas is a fast, powerful, flexible and
easy to use use open source data analysis and manipulation tool, built on top of the
Python programming language.
4. Seaborn:
Seaborn is a library for making statistical graphics in Python. It is built on top of matplotlib and
closely integrated with Pandas data structures. Here is some of the functionality that Seaborn
offers:
5.Scikit-Learn
Scikit-learnisafreesoftwaremachinelearninglibraryforthePythonprogramminglanguage.It features
various classification, regression and clustering algorithms including support
vectormachines,randomforests,gradientboosting,kmeansandDBSCAN,andisdesignedtointeropera
tewiththePythonnumerical andscientific libraries NumPyandSciPy.
The Jupyter Notebook is an open source web application that you can use to create and
share documents that contain live code, equations, visualizations, and text. Jupyter
Notebook is maintained by the people at Project Jupyter.
16
Jupyter Notebooks are a spin-off project from the IPython project, which used to have an
IPython Notebook project itself. The name, Jupyter, comes from the core supported
programming languages that it supports: Julia, Python, and R. Jupyter ships with the
IPythonkernel, which allows you to write your programs in Python, but there are
currently over 100otherkernels that yo ucan also use.
Notebooks. Explore and run machine learning code with Google Colab Notebooks, a
cloud computational environment that enables reproducible and collaborative analysis.
One of the advantages to using Notebooks as your data science workbench is that you can
easily add data sources from thousands of publicly available Datasets or even upload
your own. You can also use output files from another Notebook as a data source.
17
CHAPTER 6
IMPLEMENTATION
7) Once the model is trained, input the test data and predict the output labels.
6.2 Algorithms
A Random Forest is an outfit method that can perform both the regression
and classification tasks by using the multiple decision trees and Bootstrap Aggregation
technique, generally known as bagging. The fundamental thought of this is to combine
multiple decision trees in deciding the final outputs instead of depending on any
individual decision tree.The Random forest (RF) model is an additive model that predicts
the sales by combining decisions from a sequence of base models.
Different types of models have different advantages. The random forest model is the best
at handling tabular data with categorical features, or numerical features with least than
several categories. In contrast to linear models, random forests can catch non-linear
18
collaboration between the features and the target. Trees run in parallel in Random Forest.
No interaction is present between the trees while building it.
Step2: Construct the decision trees for all samples to obtain the prediction results.
Step4: Select the majority voted prediction result as the final prediction result.
Advantages:-
19
K-Nearest Neighbour Algorithm
Advantages :
6.3 Code
1. Import the libraries and load the dataset
First, we are going to import all the modules that we are going to need for training our model.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split,KFold,StratifiedKFold,GridSearchCV,Rand
omizedSearchCV,cross_val_score
from sklearn.neighbors import KNeighborsClassifier,KNeighborsRegressor
import sklearn.metrics as metrics
20
from sklearn.metrics import r2_score,roc_auc_score,classification_report,mean_squared_error,ac
curacy_score,confusion_matrix
Loading Dataset
train = pd.read_csv("/content/drive/MyDrive/sdrive/siri12/train.csv")
test = pd.read_csv("/content/drive/MyDrive/sdrive/siri12/test.csv")
train.head()
train_cpy=train.copy()
test_cpy=test.copy()
train.shape
train.info()
train.Product_ID.nunique()
train.User_ID.nunique()
train_cat=train.select_dtypes(include='object')
train_cat.drop(['Product_ID'],axis=1,inplace=True)
train_cat.columns
for i in train_cat.columns:
train[i].value_counts().plot.bar()
plt.title('{0}'.format(i))
plt.show()
train_numeric=train.select_dtypes(include=['int64','float64'])
train_numeric.drop(['User_ID'],axis=1,inplace=True)
train_numeric.columns
for i in train_numeric.columns:
plt.hist(train[i])
plt.title('{0}'.format(i))
plt.show()
train_numeric.corr()
21
sns.barplot(x='Gender', y='Purchase', data=train)
plt.show()
sns.barplot(x='Age', y='Purchase', data=train)
plt.show()
sns.barplot(x='City_Category', y='Purchase', data=train)
plt.show()
sns.barplot(x='Stay_In_Current_City_Years', y='Purchase', data=train)
plt.show()
sns.barplot(x='Marital_Status', y='Purchase', data=train)
plt.show()
train["Product_Category_1_Count"] = train.groupby(['Product_Category_1'])
['Product_Category_1'].transform('count')
pc1_count_dict = train.groupby(['Product_Category_1']).size().to_dict()
test['Product_Category_1_Count'] = test['Product_Category_1'].apply(lambda x:pc1_count_dict.
get(x,0))
train["Product_Category_2_Count"] = train.groupby(['Product_Category_2'])
['Product_Category_2'].transform('count')
pc2_count_dict = train.groupby(['Product_Category_2']).size().to_dict()
test['Product_Category_2_Count'] = test['Product_Category_2'].apply(lambda x:pc2_count_dict.
get(x,0))
train["Product_Category_3_Count"] = train.groupby(['Product_Category_3'])
['Product_Category_3'].transform('count')
pc3_count_dict = train.groupby(['Product_Category_3']).size().to_dict()
test['Product_Category_3_Count'] = test['Product_Category_3'].apply(lambda x:pc3_count_dict.
get(x,0))
train["User_ID_Count"] = train.groupby(['User_ID'])['User_ID'].transform('count')
userID_count_dict = train.groupby(['User_ID']).size().to_dict()
test['User_ID_Count'] = test['User_ID'].apply(lambda x:userID_count_dict.get(x,0))
train["Product_ID_Count"] = train.groupby(['Product_ID'])['Product_ID'].transform('count')
productID_count_dict = train.groupby(['Product_ID']).size().to_dict()
test['Product_ID_Count'] = test['Product_ID'].apply(lambda x:productID_count_dict.get(x,0))
22
train["User_ID_MinPrice"] = train.groupby(['User_ID'])['Purchase'].transform('min')
userID_min_dict = train.groupby(['User_ID'])['Purchase'].min().to_dict()
test['User_ID_MinPrice'] = test['User_ID'].apply(lambda x:userID_min_dict.get(x,0))
train["User_ID_MaxPrice"] = train.groupby(['User_ID'])['Purchase'].transform('max')
userID_max_dict = train.groupby(['User_ID'])['Purchase'].max().to_dict()
test['User_ID_MaxPrice'] = test['User_ID'].apply(lambda x:userID_max_dict.get(x,0))
train["User_ID_MeanPrice"] = train.groupby(['User_ID'])['Purchase'].transform('mean')
userID_mean_dict = train.groupby(['User_ID'])['Purchase'].mean().to_dict()
test['User_ID_MeanPrice'] = test['User_ID'].apply(lambda x:userID_mean_dict.get(x,0))
train["Product_ID_MinPrice"] = train.groupby(['Product_ID'])['Purchase'].transform('min')
productID_min_dict = train.groupby(['Product_ID'])['Purchase'].min().to_dict()
test['Product_ID_MinPrice'] = test['Product_ID'].apply(lambda x:productID_min_dict.get(x,0))
train["Product_ID_MaxPrice"] = train.groupby(['Product_ID'])['Purchase'].transform('max')
productID_max_dict = train.groupby(['Product_ID'])['Purchase'].max().to_dict()
test['Product_ID_MaxPrice'] = test['Product_ID'].apply(lambda x:productID_max_dict.get(x,0))
train["Product_ID_MeanPrice"] = train.groupby(['Product_ID'])['Purchase'].transform('mean')
productID_mean_dict = train.groupby(['Product_ID'])['Purchase'].mean().to_dict()
test['Product_ID_MeanPrice'] = test['Product_ID'].apply(lambda x:productID_mean_dict.get(x,0)
)
userID_25p_dict = train.groupby(['User_ID'])['Purchase'].apply(lambda x:np.percentile(x
,25)).to_dict()
train['User_ID_25PercPrice'] = train['User_ID'].apply(lambda x:userID_25p_dict.get(x,0))
test['User_ID_25PercPrice'] = test['User_ID'].apply(lambda x:userID_25p_dict.get(x,0))
userID_75p_dict = train.groupby(['User_ID'])['Purchase'].apply(lambda x:np.percentile(x
,75)).to_dict()
train['User_ID_75PercPrice'] = train['User_ID'].apply(lambda x:userID_75p_dict.get(x,0))
test['User_ID_75PercPrice'] = test['User_ID'].apply(lambda x:userID_75p_dict.get(x,0))
productID_25p_dict = train.groupby(['Product_ID'])['Purchase'].apply(lambda x:np.percentile(x
,25)).to_dict()
23
train['Product_ID_25PercPrice'] = train['Product_ID'].apply(lambda x:productID_25p_dict.get(x,
0))
test['Product_ID_25PercPrice'] = test['Product_ID'].apply(lambda x:productID_25p_dict.get(x,0)
)
productID_75p_dict = train.groupby(['Product_ID'])['Purchase'].apply(lambda x:np.percentile(x
,75)).to_dict()
train['Product_ID_75PercPrice'] = train['Product_ID'].apply(lambda x:productID_75p_dict.get(x,
0))
test['Product_ID_75PercPrice'] = test['Product_ID'].apply(lambda x:productID_75p_dict.get(x,0)
)
round((train.isnull().sum()/len(train.index))*100,2)
round((test.isnull().sum()/len(test.index))*100,2)
train.info()
train['Age']=le.fit_transform(train['Age'])
test['Age']=le.fit_transform(test['Age'])
train['City_Category']=le.fit_transform(train['City_Category'])
test['City_Category']=le.fit_transform(test['City_Category'])
train['Stay_In_Current_City_Years']=le.fit_transform(train['Stay_In_Current_City_Years'])
test['Stay_In_Current_City_Years']=le.fit_transform(test['Stay_In_Current_City_Years'])
pd.set_option('display.max_columns', 100)
train.head(10)
train['Gender']=train['Gender'].map({'M':1, 'F':0})
test['Gender']=test['Gender'].map({'M':1, 'F':0})
train.head()
train['Product_Category_2']=train['Product_Category_2'].fillna(0)
test['Product_Category_2']=test['Product_Category_2'].fillna(0)
train['Product_Category_3']=train['Product_Category_3'].fillna(0)
test['Product_Category_3']=test['Product_Category_3'].fillna(0)
train['Product_Category_2_Count']=train['Product_Category_2_Count'].fillna(0)
test['Product_Category_2_Count']=test['Product_Category_2_Count'].fillna(0)
train['Product_Category_3_Count']=train['Product_Category_3_Count'].fillna(0)
24
test['Product_Category_3_Count']=test['Product_Category_3_Count'].fillna(0)
round((test.isnull().sum()/len(test.index))*100,2)
train=train.drop(['User_ID','Product_ID'],axis=1)
test=test.drop(['User_ID','Product_ID'],axis=1)
train.head()
q1 = train['Purchase'].quantile(0.25)
q3 = train['Purchase'].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
train = train[(train['Purchase'] > fence_low) & (train['Purchase'] < fence_high)]
X=train.drop('Purchase',1)
y=train['Purchase']
params = {}
params["eta"] = 0.03
params["min_child_weight"] = 10
params["subsample"] = 0.8
params["colsample_bytree"] = 0.7
params["max_depth"] = 10
params["seed"] = 0
plst = list(params.items())
num_rounds = 1100
alist = ['Gender',
'Age',
'Occupation',
'City_Category',
'Stay_In_Current_City_Years',
'Marital_Status',
'Product_Category_1',
'Product_Category_2',
'Product_Category_3',
25
'User_ID_Count',
'Product_ID_Count']
blist = ['User_ID_MinPrice',
'User_ID_MaxPrice',
'User_ID_MeanPrice',
'Product_ID_MinPrice',
'Product_ID_MaxPrice',
'Product_ID_MeanPrice']
clist = ['User_ID_25PercPrice',
'User_ID_75PercPrice',
'Product_ID_25PercPrice',
'Product_ID_75PercPrice',
'Product_Category_1_Count',
'Product_Category_2_Count',
'Product_Category_3_Count',]
train1 = train[alist+blist]
test1 = test[alist+blist]
train2 = train[alist+clist]
test2 = test[alist+clist]
X_train,X_test,Y_train,Y_test = train_test_split(train1,y,test_size=0.2,random_state=42)
3. Train and Test the model
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestRegressor
RFregressor = RandomForestRegressor()
RFregressor.fit(X_train,Y_train)
Y_pred_rf_reg = RFregressor.predict(X_test)
np.round_(Y_pred_rf_reg)
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()
knn.fit(X_train,Y_train)
Y_pred_knn = knn.predict(X_test)
26
4.Evaluating the model
from sklearn.metrics import mean_squared_error
from sklearn.metrics import
print("Random forest Regression: ")
print("RMSE:",np.sqrt(mean_squared_error(Y_test, Y_pred_rf_reg)))
print("R2 score:", r2_score(Y_test, Y_pred_rf_reg))
from sklearn.metrics import r2_score
print("Knn Regression: ")
print("RMSE:",np.sqrt(mean_squared_error(Y_test,0 Y_pred_knn)))
print("R2 score:", r2_score(Y_test, Y_pred_knn))
Random_Forest_Tree=pd.DataFrame({'y_test':Y_test,'prediction':Y_pred_rf_reg})
Random_Forest_Tree.to_csv("Random Forest Tree.csv")
27
CHAPTER – 7
TESTING
Testing is a process, which reveals errors in the program. It is the major quality measure
employed during software development. During testing, the program is executed with a set of
test cases and the output of the program for the test cases is evaluated to determine if the
program is performing as it is expected to perform.
Unit Testing is done on individual modules as they are completed and become executable. It is confined
only to the designer's requirements. Each module can be tested using the following two Strategies:
In this strategy some test cases are generated as input conditions that fully execute all functional
requirements for the program. This testing has been uses to find errors in the following
categories:
Interface errors
Performance errors
28
In this the test cases are generated on the logic of each module by drawing flow graphs of that
module and logical decisions are tested on all the cases. It has been uses to generate the test cases
in the following cases:
Execute all loops at their boundaries and within their operational bounds
Integration testing ensures that software and subsystems work together a whole. It tests the
interface of all the modules to make sure that the modules behave properly when integrated
together. In this case the communication between the device and Google Translator Service.
Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user manuals.
Functional testing is centered on the following items
29
Involves in-house testing in an emulator of the entire system before delivery to the user. It's aim
is to satisfy the user the system meets all requirements of the client's specifications.
It is a pre-delivery testing in which entire system is tested in a real android device on real world
data and usage to find errors.
Designing a test strategy for all different types of functioning, hardware by determining the
efforts and costs incurred to achieve the objectives of the system. For any project, the test
strategy can be prepared by
Test objective is the overall goal and achievement of the test execution. Objectives are defined in
such a way that the system is bug-free and is ready to use bythe end-users. Test objective can be
defined by identifying thesoftware features that are needed to test and the goal of the test, these
features need to achieve to be noted as successful.
30
CHAPTER 8
OUTPUTS
31
Filling The Null Values
32
Accuracy
33
Purchase Prediction
34
CHAPTER 9
Conclusion
With traditional methods not being of much help to business growth in terms of revenue, the use
of Machine learning approaches proves to be an important point for the shaping of the business
plan taking into consideration the shopping pattern of consumers.
Projection of sales concerning several factors including the sale of last year helps
businesses take on suitable statergies for increasing the sales of good the are in demand.
Thus the dataset is used for the experimentation, Black Friday Sales Dataset from Kaggle.
The models used are Random Forest Regressor and K-Nearest Neighbour . The evaluation
measureused is Mean Squared Error (MSE). Based on Table II Random ForestRegressor is best
suitable for the prediction of sales based on a given dataset.
Thus the proposed model will predict the customer purchase on Black Friday and give the
retailer insight into customer choice of products. This will result in a discount based on
customer-centric choices thus increasing the profit to the retailer as well as the customer.
Future Scope
As future research, we can perform hyper parameter tuning and apply different machine
learning algorithms.
In Future we can make use of stronger gradient algorithms like Light GBM. More
efficient black-box approaches like ANN can be incorporated. The predictor model can
be further enhanced by retailers and customers belonging to different nations to suit
one’s needs.
35
36