0% found this document useful (0 votes)
68 views43 pages

Main Project

This document is a major project report submitted by three students for their Bachelor of Technology degree. It presents a machine learning application for predicting Black Friday sales. The project aims to analyze customer data and purchase behavior to accurately predict expected sales for Black Friday using testing and training of predictive models. The report includes an introduction, literature survey, system analysis and requirements, system design, implementation details, testing procedures, outputs, and a conclusion.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views43 pages

Main Project

This document is a major project report submitted by three students for their Bachelor of Technology degree. It presents a machine learning application for predicting Black Friday sales. The project aims to analyze customer data and purchase behavior to accurately predict expected sales for Black Friday using testing and training of predictive models. The report includes an introduction, literature survey, system analysis and requirements, system design, implementation details, testing procedures, outputs, and a conclusion.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 43

A MAJOR-PROJECT REPORT ON

Machine Learning Application for Black Friday Sales Prediction


Framework
Submitted in partial fulfillment for the award of the degree of

BACHELOR OF TECHNOLOGY
In

Computer science and Engineering


By
M.Bhavya Lakshmi Siridevi(19A81A05M5)

K.Chaitanya(19A81A05L2)

K.V.N.S.S.Krishna Sreekar(19A81A05K7)

Under the Esteemed Supervision of

Mr. K.Lakshmaji,M.Tech.,Assistant Professor

Department of Computer Science and Engineering (Accredited by N.B.A)

SRI VASAVI ENGINEERING COLLEGE(Autonomous)

(Affiliated to JNTUK, Kakinada)

Pedatadepalli, Tadepalligudem-534101, A.P 2022-23


SRI VASAVI ENGINEERING COLLEGE (Autonomous)
Department of Computer Science and Engineering

Pedatadepalli, Tadepalligudem

This is to certify that the Project Report entitled “Machine Learning Application for Black Friday
sales prediction framework” submitted by M.Bhavya Lakshmi Siridevi (19A81A05M5),
K.Chaitanya (19A81A05L2), K.V.N.S.S.Krishna Sreekar(19A81A05K7) for the award of the
degree of Bachelor of Technology in the Department of Computer Science and Engineering
during the academic year 2022-2023.

Name of the Project Guide Head of the Department


Mr.K.Lakshmaji M.Tech., Dr.D.Jaya Kumari M.Tech.,Ph.D,

Assistant Professor Professor & HOD.

External Examiner
DECLARATION
We here by declare that the project report entitled “Machine Learning Application for Black
Friday Sales prediction framework” submitted by us to Sri Vasavi Engineering
College(Autonomous), Tadepalligudem, affiliated to JNTUK Kakinada in partial fulfillment of
the requirement for the award of the degree of B.Tech in computer Science and Engineering is a
record of Bonafide project work carried out by us under the guidance of Mr.K.Lakshmaji,
Assistant Professor. We further declare that the work reported in this project has not been
submitted and will not be submitted, either in part or in full, for the award of any other degree in
this institute or any other institute or University.

Project Associates
M.Bhavya Lakshmi Siridevi(19A81A05M5)

K.Chaitanya(19A81A05L2)

K.V.N.S.S.Krishna Sreekar(19A81A05K7)
ACKNOWLEDGEMENT
First and foremost, we sincerely salute to our esteemed institute SRI VASAVI ENGINEERING
COLLEGE, for giving us this golden opportunity to fulfill our warm dream to become an
engineer.

Our sincere gratitude to our project guide Mr.K.Lakshmaji, Assistant professor, Department of
Computer Science and Engineering, for his timely cooperation and valuable suggestions while
carrying out this project.

We express our sincere thanks and heartful gratitude to Dr.D.Jaya Kumari, professor & Head
of the Department of Computer Science and Engineering, for permitting us to do our project.

We express our sincere thanks and heartful gratitude to Dr.G.V.N.S.R.Ratnakara Rao,


Principal, for providing a favourable environment and supporting us during the development of
this project.

Our special thanks to the management and all the teaching and non-teaching staff members,
Department of Computer Science and Engineering, for their support and cooperation in various
ways during our project work. It is our pleasure to acknowledge the help of all those respected
individuals.

We would like to express our gratitude to our parents, friends who helped to complete this
project.

Project Associates
M.Bhavya Lakshmi Siridevi(19A81A05M5)

K.Chaitanya(19A81A05L2)

K.V.N.S.S.Krishna Sreekar(19A81A05K7)
TABLE OF CONTENTS
S.NO TITLE PAGENO

ABSTRACT i

1 INTRODUCTION 1
1.1 Introduction 1
1.2 Scope 1
1.3 Objective 1

2 LITERATURE SURVEY 2

3 SYSTEM STUDY AND ANALYSIS 3-6


3.1 Problem Statement 3
3.2 Existing System 3
3.3 Limitations of the Existing System 3
3.4 Proposed System 3
3.5 Advantages of Proposed System 4
3.6 Functional Requirements 4
3.7 Non-Functional Requirements 5
3.8 System Requirements 6
3.8.1 Hardware Requirements 6
3.8.1 Software Requirements 6

4 SYSTEM DESIGN 7-11


4.1 System Design 7-8
4.2 System Architecture 8-9
4.3 UML Diagrams 9-11
5 TECHNOLOGIES 12-17
5.1 About Python 12-15
5.2 Required Python Libraries 15-17
6 IMPLEMENTATION 18-27
6.1 Implementation Steps 18
6.2 Algorithms 18-20
6.3 Code 20-27
7 TESTING 28-30
7.1 Introduction to Testing 28
7.2 Types of Testing 28
7.2.1 Unit Testing 28
7.2.1.1 Black Box Testing 28
7.2.1.2 White Box Testing 28-29
7.2.2 Integrating Testing 29
7.2.3 Functional Testing 29
7.2.4 System Testing 29-30
7.3 Test Strategy and Design 30
7.4 Test Objectives 30
7.5 Features to be Tested 30
8 OUTPUTS 31-34
9 CONCLUSION AND FURTHER WORK 35
10 REFERENCES 36
ABSTRACT

Understanding the purchasing behavior of various customers (dependent variable) against


different products using their demographic information (IS features where most of the features
are self-explanatory). This dataset consist of null values, redundant and unstructured data.
Machine learning is the most common applications in the domain retail industry. This concept
helps to develop a predictor that has a distinct commercial value to the shop owners as it will
help with their inventory management, financial planning, advertising and marketing. This entire
process of developing a model includes preprocessing, modeling, training, testing and
evaluating. Hence, frameworks will be developed to automate few of this process and its
complexity will be reduced.

i
CHAPTER 1

INTRODUCTION

1.1 Introduction

“Black Friday” is the name given to the shopping day after thanksgiving. This day was actually
considered as “Black Friday” on the grounds that the number of customers made auto collisions
and some of the time even violence. Police begat the saying to depict the disorder encompassing
the congestion of pedestrian and auto traffic in downtown shopping regions.

In retail industry, the number of sales play an important part that decides the loss or profit for
the company. Predicting the sales accurately gives the efficient industry management. Black
Friday is like a carnival sale in the USA. In this day huge sale occurs in a very less price for the
products which are much demanded. To incur the sales, a prediction model is made to hover on
the type of product which is sold in maximum numbers. A customer’s behavior is to be analyzed
in order to predict the amount of purchase to be done by him/her on a particular day. In this
paper, we will predict the sales of a company on “Black Friday”. To predict the sales of different
products based on their independent variables, we need to analyze the relationship between
different variables and well organize the darn. So that a model can perform calculations and
predicts sales accurately.

1.2 Scope

The Scope of the project includes:

 Gives better idea for marketing and financial planning.


 It helps us to predict the sales accurately based on the customer behaviour.
 Profits will be increased.

1.3 Objective

 Analyzing the data of all the customers and finding relationship of independent variables
with respect to the target variable.
 Predicting the expected sales by testing and training.

1
CHAPTER-2

LITERATURE SURVEY

1. Black Friday Sales Prediction Analysis using Machine Learning Techniques,


P Poornima, Jennifer Joyce B proposed that sales forecasting of retail stores using
ML techniques has aimed to predict the sales of a retail store using different machine
learning techniques and determined the best algorithm suited to the particular problem
statement. Normal regression techniques have been implemented as well as boosting
techniques and it has been found that the boosting algorithms have better results than he
regular regression algorithms.
2. Sales Prediction Using Machine Learning Algorithms, Purvika Bajaj et al
performed sales prediction based on a dataset collected from a grocery store to predict the
sales. The algorithms used for experimentations are Linear Regression, XGBoost and
Random Forest. The result precision is based on Root Mean Squared Error(RMSE),
Variance Score, Training and Testing Accuracies. The Random Forest algorithm
outperforms the other three algorithms.
3. Prediction of consumer Purchasing in a Grocery Store Using Machine
Learning Techniques, Y.Zuo, K.Yada, A.Ali proposed that the aim of this project
was to analyze the data based on linear models which are insufficient to satisfy the
requirement of scholars. Taking this issue in consideration this paper uses two different
machine learning methods: Bayes classifier algorithm and support vector machine (SVM)
algorithm and looks at the performance of these algorithms using data in the real world. It
has been found that the other algorithms like XGBoost, Random Forest algorithms have
better results than regular algorithms.

2
CHAPTER -3

SYSTEM STUDY AND ANALYSIS

Problem Statement

A retail company ABC wants to understand the customer purchase behavior (specifically,
purchase amount) against various products of different categories. They have shared
purchase summary of various customers for selected high volume products from last month.
The data set also contains customer demographics (age, gender, marital status, city_type,
stay_in_current_city), product details (product_id and product category) and Total
purchase_amount from last month.

Existing System

In the existing system they used the Machine Learning Algorithms like Linear regression and
Ridge regression in machine learning.

Limitations of the Existing System

 Time Consuming.
 Less Efficient.
 High Error rate.
 Using Linear regression algorithm it shows the MAE value is 86.1
 Using Ridge regression algorithm it shows the MAE value is 86

Proposed System

The proposed system uses the Machine Learning Algorithms of Random Forest and
K-Nearest Neighbour Algorithms. The system shows the classification of sales prediction
with the high accuracy.

3
Advantages of Proposed System

 More Accurate.
 Easily identifies trends and patterns.
 Quick and Efficient in Handling of Data.
 It can be more effective if the training data is large.

Functional Requirements

In Software engineering, a functional requirement defines a function of a software system or


its component. A function is described as a set of inputs, the behavior, and outputs.
Functional requirements maybecalculations,technicaldetails,data manipulationandprocessing
andother specificfunctionality that define what a system is supposed to accomplish.
Behavioralrequirements describing all the cases where the systemuses the functional
requirements are capturedin use cases. Generally, functional
requirementsareexpressedintheform“systemshalldo”.Theplanforimplementingfunctionalrequi
rementsisdetailedinthesystemdesign.Inrequirementengineering,functionalrequirementsspecify
particularresultsofasystem.Functionalrequirementsdrivetheapplication architecture of a
system. A requirements analyst generates use cases after gatheringand validating a set of
functional requirements. The hierarchy of functional requirements is:user/stakeholder request
-> feature ->usecase->requirementsanalyst generates use cases after gathering and validating
a set of functional requirements. Functionalrequirements may betechnical details, data
manipulation and other specific functionality of the project is to provide the information to
the user.

 System should be able to load trained/test datasets.


 System should be able to load essential python Libraries.
 System should be able to predict the sales.
 The predicted sales should be accurate.

4
Non-Functional Requirements

In systems engineering and requirements engineering, a non-functional requirement is a


requirement that specifies criteria that can be used to judge the operation of a system, rather
than specifies behaviors. Non-functional requirements include quantitative constraints, such
as response time or accuracy.

Security

The System should be secure and saving person’s privacy.

Reliability

The System is more reliable because of the qualities that are inherited from the chosen
platform python. The code built by using python is more reliable.

Performance

The performance characteristics of the system are outlines here:

 Response time (average,maximum)


 Throughput (frames processed per second)
 Accuracy
 Resource utilization (memory, disk, camera)

Maintainability

The maintenance group should be able to fix any problem occur suddenly.

5
System Requirements

Hardware Requirements

System : Computer

Processor : I3 or Above

Hard Disk : 256 GB

RAM : 8 GB

Software Requirements

Operating System : Windows 7 or above

Domain : Machine Learning

Programming Language : Python

6
CHAPTER-4

SYSTEM DESIGN

4.1 System Design

Prediction Machine Learning Application for Black Friday Sales prediction framework.
Usually the systemsare designed to use this learned knowledge to raised process similar input
within the future. AML algorithm is one that can learn from experience (observed examples)
with respect to someclass of tasks and a performance measure. Classification which is
additionally mentioned aspattern recognition technique is a crucial task by which machines
“learn” to
automaticallyrecognizecomplexpatterns,todifferentiatebetweenexemplarsbasedontheirdiffere
ntpatterns, and to formintelligent decisions which tends to give the proper output with
maximumaccuracy. This Design approachincludes several steps in itsimplementation
theyare:

1. Data Sets:
Model is fully trained by giving complete data as input on which supervised learning can
be done. This dataset contains 8523 observations and 12 features like product ID, User
ID, Age, Gender, Occupation,stay_in_current_city_years, marital_status
Product_category_1, Product_category_2, Product_category_3, and Purchase.
Thedataset is splitted into Trainingset andtesting set. Generally 80:20 ratio isapplied to
split the training and testing dataset. Thedata model which was createdusing Random
Forest and K-NN algorithm is applied on the training set and based onthetest
resultaccuracy the test setprediction will be done.
2. Data Preprocessing:
The data must be pre-processed before applying any machine learning (ML) algorithm to
our dataset. Also, it is necessary to convert the data into a certain form that an ML
algorithm can predict the value of the purchase variable, given customer information as
input.
It handles the missing or NaN values.

7
It can ignore a particular field if that’s not useful for data analysis/prediction.
 It can replace the categorical value with some numerical value.
For the prediction analysis–It is necessary to transform the data available in the
dataset whenever certain data is not acceptable for prediction. Here, the column age that’s
having the values of different ranges should to be transformed

3. Data Cleaning:
In a dataset, there might be some missing values which needs be get rid of. The missing
values either must be filled with null values of any mean values or it should be dropped
from the dataset. The missing values can create discrepancies in the result.
4. Building the regression model using machine learning algorithms:

For predicting the sales Random Forest algorithm and K-NN algorithm isused. It is
effective because it provides better results in regression problem. It is extremelyintuitive,easy
toimplementandprovideinterpretablepredictions.Itproducesoutofbagestimated errorwhich was
proven to be unbiased in many tests. It is relatively easy to tunewith.Itgiveshighest
accuracyresultand the low error rate for the problem.

5. Data Prediction:

Predictionreferstotheoutputofanalgorithmafterithasbeentrainedonahistoricaldatasetan applied
to new data when forecasting the likelihood of aparticularoutcome.

4.2 System Architecture:

8
4.3 Internal Architecture:

4.4 UML Diagrams

The Unified Modelling Language allows the software engineer to express an analysismodel
using the behaviour notation that is governed by a set of syntactic semantic and pragmaticrules.
A UML system is represented usingfive different views that describe the system
fromdistinctlydifferent perspective.Each viewisdefined bya setofdiagram,whichis asfollows.

4.4.1 Usecase Diagrams

The usecase diagram is dynamic in nature, there should be some internal or externalfactors for
making the interactions.These internal and external agents are known as actors. Usecase
diagrams consistsof actors, use cases andtheir relationships. The diagram is used tomodel the
system/subsystem of an application.A single use case diagram captures a particularfunctionality
of a system. Hence to model the entire system, a number of use case diagrams areused.

9
Purpose of Usecase Diagrams

Use case diagrams are used to gather the requirements of a system including internal andexternal
influences. These requirements are mostly design requirements. Hence, when a
systemisanalyzedtogather itsfunctionalities, use casesarepreparedandactors areidentified. When
the initial task is complete, use case diagrams are modeled to present the outside view. In brief,
the purposes of use case diagrams can be said to be as follows-

 Used to gather the requirements of a system.


 Used to get an outside view of a system.
 Identify the external and internal factors influencing the system.
 Show the interaction among the requirements are actors.

4.4.1.1 Use case Diagram For Loan Prediction

10
4.4.2 Sequence Diagram

The sequence diagram represents the flow of messages in the system and isalso termed as an
event diagram. It helps in envisioning several dynamic scenarios. Itportraysthe communication
betweenanytwo lifelines as atime-ordered sequenceof events,suchthat theselifelines took part
attheruntime.In UML, thelifelineisrepresented by a vertical bar, whereas the message flow is
represented by a vertical dotted line thatextends
acrossthebottomofthepage.Itincorporatestheiterationsaswellasbranching.

Purpose of a Sequence Diagram

1. Tomodel high-levelinteractionamongactiveobjectswithinasystem.
2. To modelinteractionamongobjectsinsideacollaborationrealizingausecase.
3. Iteithermodelsgenericinteractionsorsome certaininstancesofinteraction.

4.4.2.1 Sequence Diagram For Loan Prediction

11
CHAPTER-5
TECHNOLOGIES
5.11 About Python
Python is currently the most widely used multi-purpose, high-level programming
language.Pythonisaninterpreted,object-oriented,high-
levelprogramminglanguagewithdynamicsemantics. Its high-level built-in data structures,
combined with dynamic typing and dynamicbinding, make it very attractive for Rapid
Application Development, as well as for use as ascripting or glue language to connect existing
components together. Python's simple, easy tolearn syntax emphasizes readability and therefore
reduces the cost of program maintenance.Python supports modules and packages,which
encourages program modularity and code reuse.The Python interpreter and the extensivestandard
library are available in source or binary formwithout charge for all major platforms, and can be
freely distributed. The biggest strength ofPythonishugecollectionofstandard librarywhichcanbe
usedfor thefollowing

 Machine Learning

 GUIApplications(like Kivy,Tkinter,PyQtetc. )
 Webframeworkslike Django(usedbyYouTube,Instagram,Dropbox)
 Imageprocessing(likeOpenCV,Pillow)

 Webscraping(like Scrapy,BeautifulSoup,Selenium)
 Testframeworks
 Multimedia

Advantages of python:

Extensive Libraries:
Python downloads with an extensive library and it contain code forvarious purposes like regular
expressions, documentation-generation, unittesting,web browsers,threading, databases, CGI,

12
email, image manipulation, and more. So, we don’t have to write thecompletecodefor that
manually.

Extensible:
As we have seen earlier, Python can be extended to other languages.You canwrite some of your
code in languages like C++ or C. This comes in handy, especially inprojects.

Embeddable:
Complimentary to extensibility, Python is embeddable as well. Youcan putyour Python code in
your source code of a different language, like C++. This lets us
addscriptingcapabilitiestoourcodeinthe otherlanguage.

Improved Productivity:
Thelanguage’ssimplicityandextensivelibrariesrenderprogrammersmore productive than
languages like Java and C++ do. Also, the fact that you needtowrite less andget more thingsdone.

Simple and Easy:


When working with Java, you may have to create a class to print‘HelloWorld’. But in Python,
just a print statement will do. It is also quite easy to learn, understand,and code. This is why
when people pickup Python, they have a hard time adjusting to othermoreverbose languages like
Java.

Readable:
Because it is not such a verbose language, reading Python is much like readingEnglish. This is
the reason why it is so easy to learn, understand, and code. It also does not needcurly braces to
define blocks, and indentation is mandatory. These further aids the readability ofthecode.

Object-Oriented:
Thislanguagesupportsboththeproceduralandobject-orientedprogramming paradigms. While
functions help us with code reusability, classes and objects letusmodel the realworld.A
classallows the encapsulation ofdata and functionsintoone.

13
Free and open source:
Like we said earlier, Python is freely available. But notonly canyou download python for free,
but you can also download its source code, make changes to it,and even distribute it. It
downloads with an extensive collection of libraries to help you withyourtasks.

Portable:
When you code your project in a language like C++, you may need to makesomechanges to it if
you want to run it on another platform. But it isn’t the same withPython. Here,you,need to code
only once, and you can run it any where. This iscalled Write Once RunAnywhere
(WORA).However,youneedtobecarefulenoughnottoinclude any systemdependentfeatures.

Interpreted:
Lastly,wewillsay thatitisaninterpretedlanguage.Sincestatementsareexecuted one
byone,debuggingis easier thanincompiled languages.

Disadvantages of Python:

Speed Limitations:
We have seen that Python code is executed line by line. But since python is interpreted ,I to ften
results in slow execution .This, however,isn’t a problem unless speed is a focal point for the
project. In otherwords, unless high speed is requirement ,the benefits offered byPython are
enough to distractus from its speed limitations.

Weak in Mobile Computing and Browsers:


While it serves as an excellent serversidelanguage ,Python is much rarely seen on the client-
side.Besides that,it is rarely ever used to implement smartphone-based to implement smartphone-
based applications. One such application is called Carbonnelle.

Design Restrictions:
As you know, Python is dynamically-typed. This means that you don’t need to declare the type
of variable while writing the code. It uses duck typing. But wait, what’s that ? Well, it just means

14
that if it looks like a duck, it must be a duck. While this is easy on the programmers during
coding, it can raise runtime errors.

Underdeveloped Database Access Layers:


ComparedtomorewidelyusedtechnologieslikeJDBC(JavaDataBaseConnectivity)andODBC(Open
DataBaseConnectivity), Python’s data base access layers are a bit under developed.
Consequently, it is less often applied in huge enterprises.

Required Python Libraries:

1. NumPy
2. Pandas
3. Matploblib
4. Seaborn
5. Scikit learn

1. NumPy:

 NumPy is a library for the Python programming language, adding support for large,
multi-dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays.
 NumPy is a Python library used for working with arrays. It also has functions for working
in domain of linear algebra, fourier transform, and matrices. NumPy was created in
2005by Travis Oliphant. It is an open source project and you can use it freely. It is the
fundamental package for scientific computing with Python.

2. Pandas:

 Pandas is a software library written for the Python programming language for data
manipulation and analysis. In particular, it offers data structures and operations for
manipulating numerical tables and time series. It is free software released under the three-
clause BCD license.
 Pandas is a software library written for the Python programming language for data
manipulation and analysis. In particular, it offers data structures and operations for

15
manipulating numerical tables and time series. pandas is a fast, powerful, flexible and
easy to use use open source data analysis and manipulation tool, built on top of the
Python programming language.

4. Seaborn:

Seaborn is a library for making statistical graphics in Python. It is built on top of matplotlib and
closely integrated with Pandas data structures. Here is some of the functionality that Seaborn
offers:

 A dataset-oriented API for examining relationships between multiple variables


 Specialized support for using categorical variables to show observations or aggregate
statistics
 Seaborn is an amazing Python visualization library built on top of matplotlib that
provides a high-level interface for drawing attractive and informative statistical graphics
 Options for visualizing univariate or bivariate distributions and for comparing them
between subsets of data
 Automatic estimation and plotting of linear regression models for different kinds
dependent variables
 Convenient views onto the overall structure of complex datasets • High-level abstractions
for structuring multi-plot grids that let you easily build complex visualizations.

5.Scikit-Learn

Scikit-learnisafreesoftwaremachinelearninglibraryforthePythonprogramminglanguage.It features
various classification, regression and clustering algorithms including support
vectormachines,randomforests,gradientboosting,kmeansandDBSCAN,andisdesignedtointeropera
tewiththePythonnumerical andscientific libraries NumPyandSciPy.

5.2 Jupyter Notebook

 The Jupyter Notebook is an open source web application that you can use to create and
share documents that contain live code, equations, visualizations, and text. Jupyter
Notebook is maintained by the people at Project Jupyter.

16
 Jupyter Notebooks are a spin-off project from the IPython project, which used to have an
IPython Notebook project itself. The name, Jupyter, comes from the core supported
programming languages that it supports: Julia, Python, and R. Jupyter ships with the
IPythonkernel, which allows you to write your programs in Python, but there are
currently over 100otherkernels that yo ucan also use.

5.3 Google Colab Notebook

 Notebooks. Explore and run machine learning code with Google Colab Notebooks, a
cloud computational environment that enables reproducible and collaborative analysis.
 One of the advantages to using Notebooks as your data science workbench is that you can
easily add data sources from thousands of publicly available Datasets or even upload
your own. You can also use output files from another Notebook as a data source.

17
CHAPTER 6

IMPLEMENTATION

6.1 Implementation Steps

1) Import necessary packages into NoteBook.

2) For implementing, we need dataset. So, Load the Dataset.

3) Preprocess the loaded dataset

4) Cleaning and filtering Data according to the requirement.

5) Selecting features on the basis of the relation with loan approval.

6) Training our model with the help of decision tree algorithm.

7) Once the model is trained, input the test data and predict the output labels.

8) Calculate the accuracy of the model.

6.2 Algorithms

Random Forest Regressor Algorithm

A Random Forest is an outfit method that can perform both the regression
and classification tasks by using the multiple decision trees and Bootstrap Aggregation
technique, generally known as bagging. The fundamental thought of this is to combine
multiple decision trees in deciding the final outputs instead of depending on any
individual decision tree.The Random forest (RF) model is an additive model that predicts
the sales by combining decisions from a sequence of base models.

Different types of models have different advantages. The random forest model is the best
at handling tabular data with categorical features, or numerical features with least than
several categories. In contrast to linear models, random forests can catch non-linear

18
collaboration between the features and the target. Trees run in parallel in Random Forest.
No interaction is present between the trees while building it.

The Random forest algorithm is given as below:

Step1: Select the random samples from the data set.

Step2: Construct the decision trees for all samples to obtain the prediction results.

Step3: Voting should be done for all the predicted results.

Step4: Select the majority voted prediction result as the final prediction result.

Advantages:-

 It can perform both regression and classification tasks.


 A random forest produces good predictions that can be understood easily.
 It can handle large datasets efficiently.

19
K-Nearest Neighbour Algorithm

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.K-NN algorithm assumes the similarity between the new
case/data and available cases and put the new case into the category that is most similar to
the available categories.This algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then it can be easily
classified into a well suite category by using K- NN algorithm.K-NN is a non-parametric
algorithm, which means it does not make any assumption on underlying data.It is also called
a lazy learner algorithm because it does not learn from the training set immediately instead it
stores the dataset and at the time of classification, it performs an action on the dataset.

Advantages :

 It is very easy to implement.


 It is robust to the noisy training data
 It can be more effective if the training data is large.

6.3 Code
1. Import the libraries and load the dataset
First, we are going to import all the modules that we are going to need for training our model.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split,KFold,StratifiedKFold,GridSearchCV,Rand
omizedSearchCV,cross_val_score
from sklearn.neighbors import KNeighborsClassifier,KNeighborsRegressor
import sklearn.metrics as metrics

20
from sklearn.metrics import r2_score,roc_auc_score,classification_report,mean_squared_error,ac
curacy_score,confusion_matrix

Loading Dataset
train = pd.read_csv("/content/drive/MyDrive/sdrive/siri12/train.csv")
test = pd.read_csv("/content/drive/MyDrive/sdrive/siri12/test.csv")

2 .Preprocess the data

train.head()
train_cpy=train.copy()
test_cpy=test.copy()
train.shape
train.info()
train.Product_ID.nunique()
train.User_ID.nunique()
train_cat=train.select_dtypes(include='object')
train_cat.drop(['Product_ID'],axis=1,inplace=True)
train_cat.columns
for i in train_cat.columns:
train[i].value_counts().plot.bar()
plt.title('{0}'.format(i))
plt.show()
train_numeric=train.select_dtypes(include=['int64','float64'])
train_numeric.drop(['User_ID'],axis=1,inplace=True)
train_numeric.columns
for i in train_numeric.columns:
plt.hist(train[i])
plt.title('{0}'.format(i))
plt.show()
train_numeric.corr()

21
sns.barplot(x='Gender', y='Purchase', data=train)
plt.show()
sns.barplot(x='Age', y='Purchase', data=train)
plt.show()
sns.barplot(x='City_Category', y='Purchase', data=train)
plt.show()
sns.barplot(x='Stay_In_Current_City_Years', y='Purchase', data=train)
plt.show()
sns.barplot(x='Marital_Status', y='Purchase', data=train)
plt.show()
train["Product_Category_1_Count"] = train.groupby(['Product_Category_1'])
['Product_Category_1'].transform('count')
pc1_count_dict = train.groupby(['Product_Category_1']).size().to_dict()
test['Product_Category_1_Count'] = test['Product_Category_1'].apply(lambda x:pc1_count_dict.
get(x,0))
train["Product_Category_2_Count"] = train.groupby(['Product_Category_2'])
['Product_Category_2'].transform('count')
pc2_count_dict = train.groupby(['Product_Category_2']).size().to_dict()
test['Product_Category_2_Count'] = test['Product_Category_2'].apply(lambda x:pc2_count_dict.
get(x,0))
train["Product_Category_3_Count"] = train.groupby(['Product_Category_3'])
['Product_Category_3'].transform('count')
pc3_count_dict = train.groupby(['Product_Category_3']).size().to_dict()
test['Product_Category_3_Count'] = test['Product_Category_3'].apply(lambda x:pc3_count_dict.
get(x,0))
train["User_ID_Count"] = train.groupby(['User_ID'])['User_ID'].transform('count')
userID_count_dict = train.groupby(['User_ID']).size().to_dict()
test['User_ID_Count'] = test['User_ID'].apply(lambda x:userID_count_dict.get(x,0))
train["Product_ID_Count"] = train.groupby(['Product_ID'])['Product_ID'].transform('count')
productID_count_dict = train.groupby(['Product_ID']).size().to_dict()
test['Product_ID_Count'] = test['Product_ID'].apply(lambda x:productID_count_dict.get(x,0))

22
train["User_ID_MinPrice"] = train.groupby(['User_ID'])['Purchase'].transform('min')
userID_min_dict = train.groupby(['User_ID'])['Purchase'].min().to_dict()
test['User_ID_MinPrice'] = test['User_ID'].apply(lambda x:userID_min_dict.get(x,0))
train["User_ID_MaxPrice"] = train.groupby(['User_ID'])['Purchase'].transform('max')
userID_max_dict = train.groupby(['User_ID'])['Purchase'].max().to_dict()
test['User_ID_MaxPrice'] = test['User_ID'].apply(lambda x:userID_max_dict.get(x,0))

train["User_ID_MeanPrice"] = train.groupby(['User_ID'])['Purchase'].transform('mean')
userID_mean_dict = train.groupby(['User_ID'])['Purchase'].mean().to_dict()
test['User_ID_MeanPrice'] = test['User_ID'].apply(lambda x:userID_mean_dict.get(x,0))
train["Product_ID_MinPrice"] = train.groupby(['Product_ID'])['Purchase'].transform('min')
productID_min_dict = train.groupby(['Product_ID'])['Purchase'].min().to_dict()
test['Product_ID_MinPrice'] = test['Product_ID'].apply(lambda x:productID_min_dict.get(x,0))
train["Product_ID_MaxPrice"] = train.groupby(['Product_ID'])['Purchase'].transform('max')
productID_max_dict = train.groupby(['Product_ID'])['Purchase'].max().to_dict()
test['Product_ID_MaxPrice'] = test['Product_ID'].apply(lambda x:productID_max_dict.get(x,0))
train["Product_ID_MeanPrice"] = train.groupby(['Product_ID'])['Purchase'].transform('mean')
productID_mean_dict = train.groupby(['Product_ID'])['Purchase'].mean().to_dict()
test['Product_ID_MeanPrice'] = test['Product_ID'].apply(lambda x:productID_mean_dict.get(x,0)
)
userID_25p_dict = train.groupby(['User_ID'])['Purchase'].apply(lambda x:np.percentile(x
,25)).to_dict()
train['User_ID_25PercPrice'] = train['User_ID'].apply(lambda x:userID_25p_dict.get(x,0))
test['User_ID_25PercPrice'] = test['User_ID'].apply(lambda x:userID_25p_dict.get(x,0))
userID_75p_dict = train.groupby(['User_ID'])['Purchase'].apply(lambda x:np.percentile(x
,75)).to_dict()
train['User_ID_75PercPrice'] = train['User_ID'].apply(lambda x:userID_75p_dict.get(x,0))
test['User_ID_75PercPrice'] = test['User_ID'].apply(lambda x:userID_75p_dict.get(x,0))
productID_25p_dict = train.groupby(['Product_ID'])['Purchase'].apply(lambda x:np.percentile(x
,25)).to_dict()

23
train['Product_ID_25PercPrice'] = train['Product_ID'].apply(lambda x:productID_25p_dict.get(x,
0))
test['Product_ID_25PercPrice'] = test['Product_ID'].apply(lambda x:productID_25p_dict.get(x,0)
)
productID_75p_dict = train.groupby(['Product_ID'])['Purchase'].apply(lambda x:np.percentile(x
,75)).to_dict()
train['Product_ID_75PercPrice'] = train['Product_ID'].apply(lambda x:productID_75p_dict.get(x,
0))
test['Product_ID_75PercPrice'] = test['Product_ID'].apply(lambda x:productID_75p_dict.get(x,0)
)
round((train.isnull().sum()/len(train.index))*100,2)
round((test.isnull().sum()/len(test.index))*100,2)
train.info()
train['Age']=le.fit_transform(train['Age'])
test['Age']=le.fit_transform(test['Age'])
train['City_Category']=le.fit_transform(train['City_Category'])
test['City_Category']=le.fit_transform(test['City_Category'])
train['Stay_In_Current_City_Years']=le.fit_transform(train['Stay_In_Current_City_Years'])
test['Stay_In_Current_City_Years']=le.fit_transform(test['Stay_In_Current_City_Years'])
pd.set_option('display.max_columns', 100)
train.head(10)
train['Gender']=train['Gender'].map({'M':1, 'F':0})
test['Gender']=test['Gender'].map({'M':1, 'F':0})
train.head()
train['Product_Category_2']=train['Product_Category_2'].fillna(0)
test['Product_Category_2']=test['Product_Category_2'].fillna(0)
train['Product_Category_3']=train['Product_Category_3'].fillna(0)
test['Product_Category_3']=test['Product_Category_3'].fillna(0)
train['Product_Category_2_Count']=train['Product_Category_2_Count'].fillna(0)
test['Product_Category_2_Count']=test['Product_Category_2_Count'].fillna(0)
train['Product_Category_3_Count']=train['Product_Category_3_Count'].fillna(0)

24
test['Product_Category_3_Count']=test['Product_Category_3_Count'].fillna(0)
round((test.isnull().sum()/len(test.index))*100,2)
train=train.drop(['User_ID','Product_ID'],axis=1)
test=test.drop(['User_ID','Product_ID'],axis=1)
train.head()
q1 = train['Purchase'].quantile(0.25)
q3 = train['Purchase'].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low  = q1-1.5*iqr
fence_high = q3+1.5*iqr
train = train[(train['Purchase'] > fence_low) & (train['Purchase'] < fence_high)]
X=train.drop('Purchase',1)
y=train['Purchase']
params = {}
params["eta"] = 0.03
params["min_child_weight"] = 10
params["subsample"] = 0.8
params["colsample_bytree"] = 0.7
params["max_depth"] = 10
params["seed"] = 0
plst = list(params.items())
num_rounds = 1100
alist = ['Gender',
'Age',
'Occupation',
'City_Category',
'Stay_In_Current_City_Years',
'Marital_Status',
'Product_Category_1',
'Product_Category_2',
'Product_Category_3',

25
'User_ID_Count',
'Product_ID_Count']         
blist = ['User_ID_MinPrice',
'User_ID_MaxPrice',
'User_ID_MeanPrice',
'Product_ID_MinPrice',
'Product_ID_MaxPrice',
'Product_ID_MeanPrice']
clist = ['User_ID_25PercPrice',
'User_ID_75PercPrice',
'Product_ID_25PercPrice',
'Product_ID_75PercPrice',
'Product_Category_1_Count',
'Product_Category_2_Count',
'Product_Category_3_Count',]
train1 = train[alist+blist]
test1 = test[alist+blist]
train2 = train[alist+clist]
test2 = test[alist+clist]
X_train,X_test,Y_train,Y_test = train_test_split(train1,y,test_size=0.2,random_state=42)
3. Train and Test the model
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestRegressor
RFregressor = RandomForestRegressor()  
RFregressor.fit(X_train,Y_train)
Y_pred_rf_reg = RFregressor.predict(X_test)
np.round_(Y_pred_rf_reg)
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()
knn.fit(X_train,Y_train)
Y_pred_knn = knn.predict(X_test)

26
4.Evaluating the model
from sklearn.metrics import mean_squared_error
from sklearn.metrics import
print("Random forest Regression: ")
print("RMSE:",np.sqrt(mean_squared_error(Y_test, Y_pred_rf_reg)))
print("R2 score:", r2_score(Y_test, Y_pred_rf_reg))
from sklearn.metrics import r2_score
print("Knn Regression: ")
print("RMSE:",np.sqrt(mean_squared_error(Y_test,0 Y_pred_knn)))
print("R2 score:", r2_score(Y_test, Y_pred_knn))
Random_Forest_Tree=pd.DataFrame({'y_test':Y_test,'prediction':Y_pred_rf_reg})
Random_Forest_Tree.to_csv("Random Forest Tree.csv")

27
CHAPTER – 7

TESTING

7.1 INTRODUCTION TO TESTING

Testing is a process, which reveals errors in the program. It is the major quality measure
employed during software development. During testing, the program is executed with a set of
test cases and the output of the program for the test cases is evaluated to determine if the
program is performing as it is expected to perform.

7.2 Types of Testing

7.2.1 Unit Testing

Unit Testing is done on individual modules as they are completed and become executable. It is confined
only to the designer's requirements. Each module can be tested using the following two Strategies:

7.2.1.1 Black Box Testing

In this strategy some test cases are generated as input conditions that fully execute all functional
requirements for the program. This testing has been uses to find errors in the following
categories:

Incorrect or missing functions

 Interface errors

Errors in data structure or external database access

 Performance errors

Initialization and termination errors.


In this testing only the output is checked for correctness. The logical flow of the data is not
checked.

7.2.1.2 White Box Testing

28
In this the test cases are generated on the logic of each module by drawing flow graphs of that
module and logical decisions are tested on all the cases. It has been uses to generate the test cases
in the following cases:

Guarantee that all independent paths have been Executed.

Execute all logical decisions on their true and false Sides.

 Execute all loops at their boundaries and within their operational bounds

Execute internal data structures to ensure their validity.

7.2.2 Integrating Testing

Integration testing ensures that software and subsystems work together a whole. It tests the
interface of all the modules to make sure that the modules behave properly when integrated
together. In this case the communication between the device and Google Translator Service.

7.2.3 Functional testing

Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user manuals.
Functional testing is centered on the following items

Valid Input: identified classes of valid input must be accepted.

•Invalid Input: identified classes of invalid input must be rejected.

• Functions: identified functions must be exercised.

• Output: identified classes of application outputs must be exercised.

• Systems/Procedures: interfacing systems or procedures must be invoked

7.2.4 System Testing

29
Involves in-house testing in an emulator of the entire system before delivery to the user. It's aim
is to satisfy the user the system meets all requirements of the client's specifications.

7.2.5 Acceptance Testing

It is a pre-delivery testing in which entire system is tested in a real android device on real world
data and usage to find errors.

7.3 Test Strategy and Design

Designing a test strategy for all different types of functioning, hardware by determining the
efforts and costs incurred to achieve the objectives of the system. For any project, the test
strategy can be prepared by

Defining the scope of the testing.

 Identifying the type of testing required

 Risks and issues

Creating test logistics.

7.4 Test Objectives

Test objective is the overall goal and achievement of the test execution. Objectives are defined in
such a way that the system is bug-free and is ready to use bythe end-users. Test objective can be
defined by identifying thesoftware features that are needed to test and the goal of the test, these
features need to achieve to be noted as successful.

7.5 Features to be Tested

 Sales should be predicted.

Prediction should be accurate.

30
CHAPTER 8

OUTPUTS

Checking null values

31
Filling The Null Values

32
Accuracy

33
Purchase Prediction

34
CHAPTER 9

Conclusion and Future Scope

Conclusion

With traditional methods not being of much help to business growth in terms of revenue, the use
of Machine learning approaches proves to be an important point for the shaping of the business
plan taking into consideration the shopping pattern of consumers.

Projection of sales concerning several factors including the sale of last year helps
businesses take on suitable statergies for increasing the sales of good the are in demand.

Thus the dataset is used for the experimentation, Black Friday Sales Dataset from Kaggle.
The models used are Random Forest Regressor and K-Nearest Neighbour . The evaluation
measureused is Mean Squared Error (MSE). Based on Table II Random ForestRegressor is best
suitable for the prediction of sales based on a given dataset.
Thus the proposed model will predict the customer purchase on Black Friday and give the
retailer insight into customer choice of products. This will result in a discount based on
customer-centric choices thus increasing the profit to the retailer as well as the customer.

Future Scope

 As future research, we can perform hyper parameter tuning and apply different machine
learning algorithms.

 In Future we can make use of stronger gradient algorithms like Light GBM. More
efficient black-box approaches like ANN can be incorporated. The predictor model can
be further enhanced by retailers and customers belonging to different nations to suit
one’s needs.

35
36

You might also like