1822 B.E Cse Batchno 149
1822 B.E Cse Batchno 149
LEARNING ALGORITHMS
ABSTRACT
Nowadays shopping malls and Big Marts keep the track of
their sales data of each and every individual item for
predicting future demand of the customer and update the
inventory management as well. These data stores
basically contain a large number of customer data and
individual item attributes in a data warehouse. Further,
anomalies and frequent patterns are detected by mining
the data store from the data warehouse. The resultant
data can be used for predicting future sales volume with
the help of different machine learning techniques for the
retailers like Big Mart. In this paper, we propose a
predictive model using XG boost Regressor technique for
predicting the sales of a company like Big Mart and found
that the model produces better performance as
compared to existing models.
1
CONTENT
Chapte List of content Page no.
rI Abstract 1
Ii List of acronyms
Ii List of figures
i Introduction
1
2 Literature Survey 13
3 System Requirements 17
3 Hardware Requirements
.
1
3 Software Requirements
.
2
3 Language Specification
.
3
3 History of
4 . python System 33
4 Analysis
4 Purpose
.
1
4 Scope
.
2
4 Existing System
.
3
4 Proposed System
5 . System Design 36
4
5 Input Design
.
1
2
5 Output Design
.
2
5 Data Flow Diagram
6 . Module 42
3 Implementations
6 Modules
.
1
6 Data Collection
.
1
.
1
6 Data Set
.
1
.
2
6 Data Preparation
.
1
.
3
6 Model Selection
.
1
.
4
6 Analyze and Prediction
.
1
.
5
6 Accuracy on Test Set
.
1
.
6
6 Saving the Training
7 . Model 47
System Implementation
3
1
.
7
8 System Testing 48
8 Unit Testing
.
1
8 Integration Testing
.
2
4
8.3 Functional Testing
8.4 System Test
8.5 White Box Testing
8.6 Black Box Testing
8.7 Acceptance Testing
9 Results And Discussion 53
10 Conclusion 56
11 References 57
LIST OF FIGURES:
model
5
CHAPTER 1
INTRODUCTION
Big Mart is a big supermarket chain, with stores all around the country and
its current board set out a challenge to all Data Scientist out there to help
them create a model that can predict the sales, per product, for each store
to give accurate results. Big Mart has collected sales data from Kaggle, for
various products across different stores in different cities. With this
information the corporation hopes we can identify the products and stores
which play a key role in their sales and use that information to take the
6
An example of a decision tree can be explained using above binary tree.
Let‟s say you want to predict whether a person is fit given their information
like age, eating habit, and physical activity, etc. The decision nodes here
are questions like „What‟s the age?‟,
„Does he exercise?‟, „Does he eat a lot of pizzas‟? And the leaves, which are
outcomes like either „fit‟, or „unfit‟. In this case this was a binary
classification problem (a yes no type problem).
Now that we know what a Decision Tree is, we‟ll see how it works internally.
There are many algorithms out there which construct Decision Trees, but one
of the best is called as ID3 Algorithm. ID3 Stands for Iterative Dichotomiser
3.
7
o In a Decision tree, there are two nodes, which are the Decision Node
and Leaf Node. Decision nodes are used to make any decision and have
multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
8
o The decisions or the test are performed on the basis of features of
the given dataset.
o The logic behind the decision tree can be easily understood because it
shows a tree-like structure.
Root Node: Root node is from where the decision tree starts. It represents
the entire dataset, which further gets divided into two or more homogeneous
sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node
into sub- nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and
other nodes are called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm
starts from the root node of the tree. This algorithm compares the values of
root attribute with the record (real dataset) attribute and, based on the
comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the
other sub- nodes and move further. It continues the process until it reaches
10
the leaf node of the tree. The complete process can be better understood
using the below algorithm:
11
Competitive questions on Structures in
Hindi Keep Watching
o Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the
best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is
reached where you cannot further classify the nodes and called the
final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to
decide whether he should accept the offer or Not. So, to solve this problem,
the decision tree starts with the root node (Salary attribute by ASM). The root
node splits further into the next decision node (distance from the office) and
one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer). Consider the below diagram:
12
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select
the best attribute for the root node and for sub-nodes. So, to solve such
problems there is a technique which is called as Attribute selection measure or
ASM. By this measurement, we can easily select the best attribute for the
nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
13
o According to the value of information gain, we split the node and build
the decision tree.
Where,
2. Gini Index:
o It only creates binary splits, and the CART algorithm uses the Gini
index to create binary splits.
14
Pruning: Getting an Optimal Decision tree
15
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not
capture all the important features of the dataset. Therefore, a technique that
decreases the size of the learning tree without reducing accuracy is known
as Pruning. There are mainly two types of tree pruning technology used:
CHAPTER 2
LITERATURE SURVEY
16
1) A comparative study of linear and nonlinear models for aggregate retails sales
forecasting
The purpose of this paper is to compare the accuracy of various linear and
nonlinear models for forecasting aggregate retail sales. Because of the
strong seasonal fluctuations observed in the retail sales, several traditional
seasonal forecasting methods such as the time series approach and the
regression approach with seasonal dummy variables and trigonometric
functions are employed. The nonlinear versions of these methods are
implemented via neural networks that are generalized nonlinear functional
approximators. Issues of seasonal time series modeling such as
deseasonalization are also investigated. Using multiple cross-validation
samples, we find that the nonlinear models are able to outperform their
linear counterparts in out-of- sample forecasting, and prior seasonal
adjustment of the data can significantly improve forecasting performance of
the neural network model. The overall best model is the neural network built
on deseasonalized time series data. While seasonal dummy variables can be
useful in developing effective regression models for predicting retail sales,
the performance of dummy regression models may not be robust.
Furthermore, trigonometric models are not useful in aggregate retail sales
forecasting.
17
Combination of Green supply chain management, Green product deletion
decision and green cradle-to-cradle performance evaluation with Adaptive-
Neuro-Fuzzy Inference System (ANFIS) to create a green system. Several
factors like design process, client
18
specification, computational intelligence and soft computing are analysed
and emphasis is given on solving problems of real domain. In this paper, the
consumer electronics and smart systems that produce nonlinear outputs are
considered. ANFIS is used for handling these nonlinear outputs and offer
sustainable development and management. This system offers decision
making considering multiple objectives and optimizing multiple outputs. The
system also provides efficient control performance and faster data transfer.
19
4) Forecasting Monthly Sales Retail Time Series: A Case Study
20
AUTHORS: Giuseppe Nunnari, Valeria Nunnari
This paper presents a case study concerning the forecasting of monthly retail
time series recorded by the US Census Bureau from 1992 to 2016. The
modeling problem is tackled in two steps. First, original time series are de-
trended by using a moving windows averaging approach. Subsequently, the
residual time series are modeled by Non-linear Auto-Regressive (NAR)
models, by using both Neuro-Fuzzy and Feed- Forward Neural Networks
approaches. The goodness of the forecasting models, is objectively assessed
by calculating the bias, the mae and the rmse errors. Finally, the model skill
index is calculated considering the traditional persistent model as reference.
Results show that there is a convenience in using the proposed approaches,
compared to the reference one.
The multiple linear regression method was used to analyze the overlay accuracy
model and study the feasibility of using linear methods to solve parameters
of nonlinear overlay equations. The methods of analysis include changing
the number of sample points to derive the least sample number required for
solving the accurate estimated parameter values. Besides, different high-
order lens distortion parameters were ignored, and only the various modes of
low-order parameters were regressed to compare their effects on the overlay
analysis results. The findings indicate that given a sufficient number of
sample points, the usage of multiple linear regression analysis to solve the
high-order nonlinear overlay accuracy model containing seventh-order lens
distortion parameters is feasible. When the estimated values of low-order
overlay distortion parameters are far greater than those of high-order lens
21
distortion parameters, excellent overlay improvement can still be obtained
even if the high-order lens distortion parameters are ignored. When the
overlay at the four corners of image field obviously
22
exceeds that near the center of image field, it is found, through simulation,
that the seventh-order parameters overlay model established in this paper
has to be corrected by high-order lens distortion parameters to significantly
improve the overlay accuracy.
DATA MODULES
23
method for preparing and adapting raw data to a model of learning. This is the
24
first and significant step to construct a machine learning model. Real-world data
generally contain noise, missing values and may not be used in an unusable
format especially for machine learning models .
Evaluation Module:
Prediction module:
CHAPTER 3
SYSTEM REQUIREMENTS
25
System : Pentium i3 Processor.
Hard Disk : 500 GB.
Monitor : 15‟‟ LED
Input Devices : Keyboard, Mouse
Ram : 4 GB
Python:
27
Python is a Beginner's Language − Python is a great language
for the beginner-level programmers and supports the
development of a wide range of applications from simple text
processing to WWW browsers to games.
Python was developed by Guido van Rossum in the late eighties and
early nineties at the National Research Institute for Mathematics
and Computer Science in the Netherlands.
Python Features
28
Easy-to-maintain − Python's source code is fairly easy-to-maintain.
29
A broad standard library − Python's bulk of the library is very
portable and cross-platform compatible on UNIX, Windows,
and Macintosh.
31
It provides very high-level dynamic data types and supports
dynamic type checking.
CHAPTER 4
SYSTEM ANALYSIS
4.1 PURPOSE
The purpose of this document is predicting big mart sales using machine
learning algorithms. In detail, this document will provide a general description
of our project, including user requirements, product perspective, and overview
of requirements, general constraints. In addition, it will also provide the
specific requirements and functionality needed for this project - such as
4.2 SCOPE
The scope of this SRSdocument persists for the entire life cycle of the
project. This document defines the final state of the software requirements
agreed upon by the customers and designers. Finally at the end of the project
execution all the functionalities may be traceable from the SRSto the product.
The document describes the functionality, performance, constraints, interface
and reliability for the entire life cycle of the project.
32
4.3 EXISTING SYSTEM:
33
Auto-Regressive Integrated Moving Average, (ARMA) Auto-Regressive
Moving Average, have been utilized to develop a few deals forecast
standards. Be that as it may, deals anticipating is a refined issue and
is influenced by both outer and inside factors, and there are two
significant detriments to the measurable technique as set out in A. S.
Weigend et A mixture occasional quantum relapse approach and
(ARIMA) Auto-Regressive Integrated Moving Average way to deal with
every day food deals anticipating were recommend by N. S. Arunraj
and furthermore found that the exhibition of the individual model was
moderately lower than that of the crossover model.
E. Hadavandi utilized the incorporation of “Genetic Fuzzy Systems
(GFS)” and information gathering to conjecture the deals of the printed
circuit board. In their paper, K-means bunching delivered K groups of
all information records. At that point, all bunches were taken care of
into autonomous with a data set tuning and rule-based extraction
ability.
Perceived work in the field of deals gauging was done by P.A. Castillo,
Sales estimating of new distributed books was done in a publication
market the executives setting utilizing computational techniques.
“Artificial neural organizations” are additionally utilized nearby income
estimating. Fluffy Neural Networks have been created with the
objective of improving prescient effectiveness, and the Radial “Base
Function Neural Network (RBFN)” is required to have an incredible
potential for anticipating deals.
Complex models like neural networks are overkill for simple problems
like regression.
Existing system models prediction analysis which gives less accuracy.
Forecasting methods and applications contains Lack of Data and short
life cycles. So some of the data like historical data, consumer-oriented
34
markets face uncertain demands, can be prediction for accurate result.
35
The objective of this proposed system is to predict the future sales
from given data of the previous year's using Decision Tree Regression
Another objective is to conclude the best model which is more efficient
and gives fast and accurate result by using Decision Tree Regression.
To find out key factors that can increase their sales and what changes
could be made to the product or store's characteristics.
Experts also shown that a smart sales forecasting program is required
to manage vast volumes of data for business organizations.
We are predicting the accuracy for Decision Tree Regression. Our
predictions help big marts to refine their methodologies and strategies
which in turn helps them to increase their profit. The results predicted
will be very useful for the executives of the company to know about
their sales and profits. This will also give them the idea for their new
locations or Centre‟s of Bigmart
CHAPTER 5
SYSTEM DESIGN
A quality output is one, which meets the requirements of the end user
and presents the information clearly. In any system results of processing are
communicated to the users and to other system through outputs. In output
design it is determined how the information is to be displaced for immediate
need and also the hard copy output. It is the most important and direct
source information to the user. Efficient and intelligent output design
improves the system‟s relationship to help user decision-making.
The output form of an information system should accomplish one or more of the
following objectives.
UML DIAGRAMS
38
UML stands for Unified Modeling Language. UML is a standardized
general- purpose modeling language in the field of object-oriented software
engineering. The standard is managed, and was created by, the Object
Management Group.
The goal is for UML to become a common language for creating models
of object oriented computer software. In its current form UML is comprised of
two major components: a Meta-model and a notation. In the future, some
form of method or process may also be added to; or associated with, UML.
The Unified Modeling Language is a standard language for specifying,
Visualization, Constructing and documenting the artifacts of software
system, as well as for business modeling and other non-software systems.
The UML represents a collection of best engineering practices that
have proven successful in the modeling of large and complex systems.
The UML is a very important part of developing objects oriented
software and the software development process. The UML uses mostly
graphical notations to express the design of software projects.
GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so
that they can develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core
concepts.
3. Be independent of particular programming languages and development
process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations,
frameworks, patterns and components.
7. Integrate best practices.
40
graphical overview of the functionality provided by a system in terms of
actors, their goals (represented as use cases), and any dependencies
between those use cases. The main purpose of a use case diagram is to
show what system functions are performed for which actor. Roles of the
actors in the system can be depicted.
SEQUENCE DIAGRAM
diagrams.
41
CHAPTER 6
6.1 MODULES:
Data Collection
Dataset
Data Preparation
Model Selection
Analyze and Prediction
Accuracy on test set
Saving the Trained Model
MODULES DESCSRIPTION:
This is the first real step towards the real development of a machine learning
model, collecting data. This is a critical step that will cascade in how good
the model will be, the more and better data that we get, the better our
model will perform.
42
There are several techniques to collect the data, like web scraping, manual
interventions and etc.
Predictive Analysis for Big Mart Sales Using Machine Learning Algorithms
6.1.2 Dataset:
The dataset consists of 8523 individual data. There are 12 columns in the
dataset, which are described below.
2.
Wrangle data and prepare it for training. Clean that which may require it
(remove duplicates, correct errors, deal with missing values, normalization,
43
data type conversions, etc.)
44
Randomize data, which erases the effects of the particular order in which we
collected and/or otherwise prepared our data
Conditions [Decision
Nodes]
46
6.1.5 Analyze and Prediction:
Once you‟re confident enough to take your trained and tested model into the
production- ready environment, the first step is to save it into a .h5 (or)
.pkl file using a library like pickle .
Make sure you have pickle installed in your
environment. Next, let‟s import the module and
47
dump the model into .pkl file
48
CHAPTER 7
SYSTEM IMPLEMENTATION:
The system architectural design is the design process for identifying the
subsystems making up the system and framework for subsystem control and
communication. The goal of the architectural design is to establish the
overall structure of software system.
Pre-
Decision Tree Predicting the Performance
processing
Regression Sales based on Analysis and
and Feature
Big Mart the given Graph
Selection
Sales features
Dataset
49
CHAPTER-8
SYSTEM TESTING
Software system meets its requirements and user expectations and does not
fail in an unacceptable manner. There are various types of test. Each test
type addresses a specific testing requirement.
TYPES OF TESTS
50
expected results.
51
8.2 Integration testing
be exercised.
be invoked.
52
Organization and preparation of functional tests is focused on
requirements, key functions, or special test cases. In addition, systematic
coverage pertaining to identify Business process flows; data fields,
predefined processes, and successive processes
53
must be considered for testing. Before functional testing is complete, additional
tests are identified and the effective value of current tests is determined.
Unit Testing:
54
Unit testing is usually conducted as part of a combined code and unit
test phase of the software lifecycle, although it is not uncommon for coding
and unit testing to be conducted as two distinct phases.
Test objectives
All field entries must work properly.
Pages must be activated from the identified link.
The entry screen, messages and responses must not be delayed.
Features to be tested
Verify that the entries are of the correct format
No duplicate entries should be allowed
All links should take the user to the correct page.
Integration Testing
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
CHAPTER 9
56
FIG 9.1 lOG IN PAGE
57
FIG 9.3 PREPOCESSED DATASET OF BIGMART SALES
CHAPTER 10
59
CONCLUSION
In this work, the effectiveness of Decision Tree Regression on the data on
revenue and review of, best performance-algorithm, here propose software
to using regression approach for predicting the sales centered on sales data
from the past the accuracy of linear regression prediction can be enhanced
with this method, and Decision Tree Regression can be determined. So, we
can conclude Decision Tree Regression gives the better prediction with
respect to Accuracy.
FUTURE WORK:
In future, the forecasting sales and building a sales plan can help to avoid
unforeseen cash flow and manage production, staff and financing needs
more effectively. In future work we can also consider with the ARIMA model
CHAPTER-11
REFERENCES
[1] Ching Wu Chu and Guoqiang Peter Zhang, “A comparative
study of linear and nonlinear models for aggregate retails sales
forecasting”, Int. Journal Production Economics, vol. 86, pp. 217-
231, 2003.
[3] Suma, V., and Shavige Malleshwara Hills. "Data Mining based
Prediction of Demand in Indian Market for Refurbished
60
Electronics." Journal of Soft Computing Paradigm (JSCP) 2, no. 02
(2020): 101- 110
61
[4]Giuseppe Nunnari, Valeria Nunnari, “Forecasting Monthly Sales
Retail Time Series: A Case Study”, Proc. of IEEE Conf. on Business
Informatics (CBI), July 2017.
[5] https://halobi.com/blog/sales-forecasting-five-uses/.
63
[10] Xinqing Shu, Pan Wang, “An Improved Adaboost
Algorithm based on Uncertain Functions”, Proc. of Int. Conf. on
Industrial Informatics – Computing Technology, Intelligent
Technology, Industrial Information Integration, Dec. 2015.
[18] Pei Chann Chang and Yen-Wen Wang, “Fuzzy Delphi and back
propagation model for sales forecasting in PCB industry”, Expert
systems with applications, vol. 30,pp. 715-726, 2006.
[23] https://www.kaggle.com/brijbhushannanda1979/bigmartsalesdata.
66