0% found this document useful (0 votes)
26 views5 pages

Sales Analysis and Prediction Using Pyth

This document discusses sales analysis and prediction using machine learning algorithms in Python. It introduces the topic of big data analytics and tools used for analyzing structured and unstructured data. Various machine learning models like linear regression and decision trees are compared to predict sales of a product using a dataset. The goal is to determine which model performs best for obtaining accurate results. Python and its libraries are used for implementing the analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views5 pages

Sales Analysis and Prediction Using Pyth

This document discusses sales analysis and prediction using machine learning algorithms in Python. It introduces the topic of big data analytics and tools used for analyzing structured and unstructured data. Various machine learning models like linear regression and decision trees are compared to predict sales of a product using a dataset. The goal is to determine which model performs best for obtaining accurate results. Python and its libraries are used for implementing the analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Palak Mittal, et. al.

International Journal of Engineering Research and Applications


www.ijera.com
ISSN: 2248-9622, Vol. 10, Issue 5, (Series-III) May 2020, pp. 50-54

RESEARCH ARTICLE OPEN ACCESS

Sales Analysis and Prediction Using Python


Palak Mittal*, Sujay**, Simran***, Krishan Kumar****, Pronika Chawla*****
*(Department of CSE, MRIIRS, Faridabad
**(Department of CSE, MRIIRS, Faridabad
***(Department of CSE, MRIIRS, Faridabad
****(Department of CSE, MRIIRS, Faridabad
*****(Department of CSE, MRIIRS, Faridabad

ABSTRACT
These days shopping centers and Big Marts maintain record of their selling details for all the persons to forecast
the customer’s potential demand and even monitor the inventory control. In a data center these data warehouses
essentially comprise a vast amount of consumer details and individual object attributes. In fact, deviations and
repeated variations are identified by removing data from the data warehouse. The resulting results will be used
to forecast potential revenue figures for retailers like Big Mart using numerous machine learning techniques. In
this paper, we build a predictive model using machine learning algorithms for predicting the sales of a company
and find which model performs better. The models are compared to find out which model performs better in
terms of performance.
Keywords: Data Analytics, Machine Learning, Linear Regression, Random Forest, Python
----------------------------------------------------------------------------------------------------------------------------- ----------
Date of Submission: 13-05-2020 Date of Acceptance: 26-05-2020
----------------------------------------------------------------------------------------------------------------------------- ----------

I. INTRODUCTION runs on BigInsights Big Data Platform. These tools


As the internet is growing rapidly, we can be used to better understand the mood of the
have switched from utilizing standard data such as people about a certain activity that is going on in
texts, documents etc to the more diverse types of their region and the world.[2]
data consisting of a huge amount of high-quality The data can be of various types such as
audio, images, photographs, interactive charts, structured, semi-structures or unstructured.
position data and much more. Each single second Structured data is the type of data that is in the
the data is becoming bigger and bigger. It is of no forms of rows and columns. These are basically
use to have big data if it is not being utilized for tables of data in a database. Structured data
taking decisions.[1] requires minimum processing and is the easiest to
Today, data analytics is being used across analyze and it can directly be fed to the model for
various fields for making predictions. One of the finding patterns, learning from the data, and then
applications of data is in the government sector. making analysis and showing the trend. Semi-
For the government sector the analysis of big data structured data is the data about data. It is basically
has proved very important. Analysis using big data the metadata.
proved instrumental in Barack Obama’s successful Unstructured data is the type of data that is
2012 re-election campaign. For BJP and its ally’s in no specific format and is difficult to analyze. It
big data processing was primarily responsible for requires a lot of pre processing to bring the data in
securing a widely competitive win. Various a form so that it can be used for analysis. It is a
methods are being used by the government of India very complex form of data and consists of data
to assess how the population of India is reacting to from all the nontraditional sources. This data can
political intervention, as well as policy-increase be in the form of audio, video, graphs, plots, power
proposals. point presentation, instant messaging, and
Another area where data analytics is being collaboration software.
used is in the field of social media analytics. The
rise of social networking has caused a large data
explosion. Numerous tools have been developed by
various organizations like IBM to evaluate social
network behavior. These tools are Cognos
Consumer Insights, which is an application that

www.ijera.com DOI: 10.9790/9622-1005035054 50 | P a g e


Palak Mittal, et. al. International Journal of Engineering Research and Applications
www.ijera.com
ISSN: 2248-9622, Vol. 10, Issue 5, (Series-III) May 2020, pp. 50-54

effective and can generate analysis primarily


based on real life records transformation
settings.
 Tableau Public: It’s an intuitive and simple
tool that offers interesting insights by data
visualization. One can inspect a hypothesis,
discover the data, and cross-check their
insights.
 Jupyter Notebook: It is an accessible tool for
Fig 1: structured and unstructured data performing end to end data science workflows
– information cleansing, statistical modeling,
In the data era, sizeable quantities of building and training machine learning models,
statistics have come to be reachable on hand to and visualizing data. [3]
decision makers. Big data refers to datasets that are
now not only big, however additionally high in Among all the different fields where data
range and velocity, which makes them challenging analytics can be used for making predictions and
to take care of using normal tools and techniques. thereby gaining insights for making decisions one
Due to the speedy boom of such data, options need of the fields is sales. We have used Big Data
to be studied and supplied in order to take care of Analytics to analyze and predict the sales of a
and extract price and expertise from these datasets. product using various different models like linear
Furthermore, decision makers want to be in a regression and decision trees. We compare the two
position to obtain treasured insights from such models to understand which of these performs
varied and unexpectedly changing data. Such fee better to obtain the best results. The language used
can be furnished using huge records analytics, for implementation is Python. The platform used
which is the utility of advanced analytics for implementation is Jupyter Notebook.
techniques on big data. There are a number of tools
that can be used for storing and analyzing data. II. DATA SET
Some of the popular tools for storing data are as Collection of data is termed as a dataset.
follows: Dataset refers to numerous database tables in the
 Apache Hadoop: It can be used to store case of data in the form of a table. The row of the
enormous amount of data in a cluster. It is a table gives information about the data set’s record
java-based framework. It can run in parallel on whereas the column gives the information about the
a cluster and is capable of allowing users to particular variable in a table. The data set gives the
process data across all nodes. This provides complete values that are stored in the database in
replication of data resulting in high availability the form of variables for all data set members.
of data. Every value present in the database is termed as a
 Hive: It’s a distributed data management for datum. These may also consist of a large number of
Hadoop. It can be used for data mining files and document.
purpose as it supports query operation like There are many different characteristics
HiveSQL for accessing the big data. that define a dataset such as the attributes and
 Apache Cassandra: It is a NoSQL database. It variables present in the dataset as well as their
is scalable, and has high performance numbers and types and the numerous statistical
distributed database tohandle large amounts of measures applied to the dataset. There are a number
data. We can store and retrieve data other than of popular built-in datasets in the Python libraries
tabular relations with the help of a NoSQL used for analysis. Few examples of such built-in
database. The qualities of this database are that databases are:
it is schema free, has a simple API, is  Iris flower dataset: It is a dataset which
consistent, supports easy replication, and can was introduces by Robert Fisher in 1936. It is a
handle large amounts of data. [1] multivariate dataset.
 MNIST database: It is used for text
Some of the popular tools for analyzing data are as classification, clustering, and image processing. It
follows: consists of the images of handwritten digits. [4]
 RapidMiner: RapidMiner can include any The dataset that we have used is the sales dataset
number of information source types, which which is acquired from Kaggle. This dataset
include Microsoft SQL, Sybase, IBM SPSS, contains two files namely train and test. Both of
Excel, Oracle, MySQL, Access, Tera data, these files are csv files. The aim is to predict the
IBM DB2, Ingress, Dbase. The tool is very sales of a product using the test data set.

www.ijera.com DOI: 10.9790/9622-1005035054 51 | P a g e


Palak Mittal, et. al. International Journal of Engineering Research and Applications
www.ijera.com
ISSN: 2248-9622, Vol. 10, Issue 5, (Series-III) May 2020, pp. 50-54

The dataset consists of 11 fields in the dataset language thus making the job easier to perform.
namely: Item_Identifier, Item_Weight, Other programming languages are harder than
Item_Fat_Content, Item_Visibility, Item_Type, Python. Python has emerged to be one of the
Item_MRP, Outlet_Identifier, favorite languages of the programmers. One that is
Outlet_Establishment_Year, Outlet_Size, widely used for developing various applications as
Outlet_Location_Type, and Outlet_Type fields. well as performing data analytics.
The description of the fields mentioned above are
as follows: 3.1 Features of Python
 Item_Identifier: This field consists of the Python can achieve better productivity with less
unique product ID of the item. It is an ID variable. amount of code. However, it is not as fast as some
 Item_Weight: This fields consists of the of the other programming languages. The features
weight of the product. This is not considered in of this language are:
hypothesis.  High-level: it has components of natural
 Item_Fat_Content: This field tells whether language that people use for communication. It is
the product has low fat or not. More than any other easy to understand what task the code is
items the low-fat items are preferred. This performing.
particular field is linked to the ‘Utility’ hypothesis.  Interpreted: Debugging errors is easy and
 Item_Visibility: This field tells us about efficient as the code is compiled line by line. This
the area assigned to a particular product with makes the Python programming language slow
respect to the percent of the total display area of all than other languages.
products. It is used for the hypothesis of the  Easy syntax: Indentations are used instead
‘display area’. of braces in Python to determine which code block
 Item_Type: This field tells about the is under a certain class or function. This makes the
category of the product. To derive more knowledge code easy to read.
about the utility this field can be used.  Dynamic Semantics: There is no need to
 Item_MRP: This field tells about the MRP initialize anything before using. This process is
of the product. This field is not important for done automatically in Python.
analysis and hence is not considered for the  Portable: There is no need to make
hypothesis. changes in the code to run it on different systems.
 Outlet_Identifier: This field consists of the This makes it easy to work on a task.
unique store ID. It is an ID variable.  Open Source: It is free and can be used
 Outlet_Establishment_Year: This field and modified by anyone as per their preference.
gives information about the year in which the store  Object-Oriented Language: It helps
was established. It is not considered in the simulate real-world scenarios and provides security
hypothesis. to get a well-made application.
 Outlet_Size: This field tells about the  Simplicity: By understanding only
ground area that the store covers. This field is indentations one can code any application in less
linked to ‘store capacity’ hypothesis. lines of code.
 Outlet_Location_Type: This field tells us  Embedding Properties: It is powerful and
about the location that is the type of city where the versatile and allows embedding of code from other
store is located. This field is linked to the ‘city languages like C.
type’ hypothesis.  Library Support: It supports various
 Outlet_Type: This field tells about libraries that can make obtaining solutions easy and
whether the store is a supermarket or a small store. fast.
This field is also connected to the ‘store capacity’
hypothesis. 3.2 Usage of Python
 Item_Outlet_Sales: This field is the  Frameworks like Django and Flask are
outcome variable that is being predicted. It tells used for developing web applications.
about the sales of the product in a store. This field  Creating workflows for the software.
is the desired outcome variable.[5]  Modifying files and data in Databases.
 Complex calculations and scientific and
III. PYTHON FOR DATA ANALYTICS analytic calculations.
Python is a programming language that
has a very easy syntax and semantics and is an 3.3 History of Python
interpreted language and high-level language. It Python programming language was
takes less effort to create applications using this developed approximately 30 years ago in 1990’s by

www.ijera.com DOI: 10.9790/9622-1005035054 52 | P a g e


Palak Mittal, et. al. International Journal of Engineering Research and Applications
www.ijera.com
ISSN: 2248-9622, Vol. 10, Issue 5, (Series-III) May 2020, pp. 50-54

Guido van Rossum and first came into being in the obtain a model. This model helps us to predict the
year 1991. The main aspect of this programming final outcome.
language is its code readability and the usage of ETL refers to Extract, Transform and
large enough to be noticed whitespace. It uses the Load. This is the tool which will combine all three
multi programming paradigm. It also makes the of the functions. It is fed the data from a particular
usage of functional, imperative, object-oriented, database and the tool transforms the input data into
structured, and reflective paradigm. a suitable format. The raw data is transformed to an
There are about 8 different understandable format by using data mining
implementations of Python programming language techniques that is data preprocessing. Data
namely: CPython, PyPy, Stackless Python, processing is a very important step as the data
MicroPython, CircuitPython, IronPython, Jython, collected from real sources may be incomplete or
RustPython. Python language is influenced by a inconsistent.
number of other languages namely: ABC, Ada,
ALGOL 68, APL, C, C++, CLU, Dylan, Haskell,
Icon, Java, Lisp, Modula-3, Perl, Standard ML.
There are languages whose development is
influenced by Python. These languages are: Apache
Groovy, Boo, Cobra, CoffeeScript, D, F#, Genie,
Go, JavaScript, Julia, Nim, Ring, Ruby, Swift.

3.4 Scope of Python


There are a number of applications for Python
which are as follows:
 Web and Internet development: Python
has a vast collection of libraries and packages of Fig 2: block diagram of the system
internet protocols to make the task of developing
web applications easier. Few of the libraries are: 4.1 Linear Regression
IMAP, FTP, image processing. Few of the It finds the relationship between the
packages present are: Feedparser, Beautifulsoup, dependent variable (Y) and one or more
Requests etc. frameworks such as Django, and independent variables (X) using one straight line
Flask are also available. which is the best fit line also termed as the
 Desktop GUI: One can draft a user regression line. The equation representing this line
interface using binary distributions of Python is:
shipped with Tk, which is a standard library for Y=a+b*X + e
GUI. In the above equation:
 Scientific and Numeric Applications: a is intercept,
Python is a powerful programming language and b is the slope of the line,
scientific and numeric applications is one of the e is the error term.
most popular applications of this language. There The accuracy cab be found out using this method.
are a number of libraries that allow to perform Although this model is very famous for analysis its
these tasks such as Numpy, Pandas, SciPy. disadvantage is that it gives less accurate results.[6]
 Software Development Application:
Python programming language can be used as a
support language (for testing, build-control and
management) for software development
applications by software developers. Few of the
examples are: SCons, Buildbot Apache Group etc.

IV. PROPOSED SYSTEM


The method to solve the problem at hand
is given below. The unprocessed data at the Big
Mart is collected. This raw data has to be pre-
processed to obtain the missing data, outliers and
the anomalies. We train two different machine
learning algorithms namely linear regression and
random forest on the raw data that is collected to Fig 3: linear regression

www.ijera.com DOI: 10.9790/9622-1005035054 53 | P a g e


Palak Mittal, et. al. International Journal of Engineering Research and Applications
www.ijera.com
ISSN: 2248-9622, Vol. 10, Issue 5, (Series-III) May 2020, pp. 50-54

4.2 Random Forest [2]. https://www.digitalvidya.com/blog/big-data-


These are also known as random decision applications/
forests. It is a machine learning algorithm that [3]. https://www.analyticsvidhya.com/blog/2018/
combines various tasks such as classification, 05/starters-guide-jupyter-notebook/
regression among others. It builds multiple decision [4]. https://en.wikipedia.org/wiki/Data_set
trees during the training period and output’s the [5]. https://medium.com/@nr3702/bigmart-sales-
class that is mode of classes that is classification or data-regression-using-python-57a5155767d7
mean prediction that is regression of the individual [6]. Heramb Kadam, Rahul Shevade,
trees. It is used to overcome the disadvantage of Prof.DevenKetkar, Mr. Sufiyan Rajguru, A
decision trees that is overfitting.[6] Forecast for Big Mart Sales Based on
Random Forests and Multiple Linear
Regression, BE IT, FAMT, Ratnagiri,
Assistant Professor ,IT department, FAMT,
IJEDR 2018 | Volume 6, Issue 4 | ISSN:
2321-9939

Fig 4: Random Forest

V. CONCLUSION
A software tool is proposed by us for
predicting the future sales based on the historical
data. With this tool, it can be found out how precise
is the prediction for linear regression and random
forest machine learning algorithms.

ACKNOWLEDGEMENT
The successful realization of the project is
an outgrowth of a consolidated effort of people
from disparate fronts. We are thankful to Dr.
Krishan Kumar for his valuable advice and support
extended to us without which we would have not
been able to complete the project for success.
We are thankful to Ms. Pronika Chawla
for her guidance and support.
Words cannot express our gratitude for all
those people who helped us directly or indirectly in
our Endeavour. We take this opportunity to express
our sincere thanks to everyone for their valuable
suggestions and also to our family and friends for
their support.

REFERENCES
[1]. Palak Mittal, Mansi Sharma, Dr. Prateek
Jain, A Detailed Study of Security and
Privacy Concerns in Big Data,International
Journal of Applied Engineering Research
ISSN 0973-4562 Volume 13, Number 10
(2018) pp. 7406-7411

www.ijera.com DOI: 10.9790/9622-1005035054 54 | P a g e

You might also like