Intern Report
Intern Report
CHAPTER 1
INTRODUCTION
Day by day competition among different shopping malls as well as big marts is getting more
serious and aggressive only due to the rapid growth of the global malls and on-line shopping.
Every mall or mart is trying to provide personalized and short-time offers for attracting more
customers depending upon the day, such that the volume of sales for each item can be predicted
for inventory management of the organization, logistics and transport service, etc. Present
machine learning algorithm are very sophisticated and provide techniques to predict or forecast
the future demand of sales for an organization, which also helps in overcoming the cheap
availability of computing and storage systems.
Big Mart is a Grocery Super Market Brand. Big Mart Brand has started out its journey with
free home delivery offerings of food and grocery. Big Mart lets in you to walk far away from
the drudgery of grocery shopping and welcome a clean comfortable way of browsing and
shopping for groceries. Discover new merchandise and shop for all of your food and grocery
desires from the comfort of your private home or workplace. No greater getting stuck in traffic
jams, procuring parking, standing in long queues and wearing heavy bags – get everything you
want when you want, right at the doorstep.
In this paper, we are addressing the problem of big mart sales prediction or forecasting of an
item on customer’s future demand in different big mart stores across various locations and
products based on the previous record. Different machine learning algorithms like linear
regression analysis, random forest, etc. are used for the prediction of sales volume. Since good
sales are the life of every organization the forecasting of sales plays an important role in any
shopping complex. Always a better prediction is helpful, to develop as well as to enhance the
strategies of business about the marketplace which is also helpful to improve the knowledge of
marketplace.
A standard sales prediction study can help in deeply analyzing the situations or the conditions
previously occurred and then, the inference can be applied about customer acquisition,
funds inadequacy and strengths before setting a budget and marketing plans for the upcoming
year. In other words, sales prediction is based on the available resources from the past. In depth
knowledge of past is required for enhancing and improving the likelihood of marketplace
irrespective of any circumstances especially the external circumstance, which allows to prepare
the upcoming needs for the business. The basic and foremost technique used in predicting sale
is the statistical methods, which is also known as the traditional method, but these methods
take much more time for predicting a sale also these methods could not handle non linear data
so to over these problems in traditional methods machine learning techniques are deployed.
Machine learning techniques can not only handle non-linear data but also huge data-set
efficiently. To measure the performance of the models we can use the accuracy measure so that
accordingly we can decide which model predicts better.
This is a complete exploratory analysis on the Big Mart Sales. It’s a regression practice problem
where in we have to predict sales product-wise and store-wise
CHAPTER 2
PROBLEM STATEMENT
Most of the business organizations heavily depend on a knowledge base and demand prediction
of sales trends. Sales forecasting is the process of estimating future sales. Accurate sales forecasts
enable companies to make informed business decisions and predict short-term and long-term
performance. Companies can base their forecasts on past sales data, industrywide comparisons,
and economic trends. Sales forecasts help sales teams achieve their goals by identifying early
warning signals in their sales pipeline and course correct before it’s too late. The goal is to
improve the accuracy from the existing project. So that the sales and profit could be increased for
the companies. Choosing an efficient algorithm from comparing different algorithms to improve
the prediction further more.
The primary issues that need to be addressed are:
1. Data Deluge: Small businesses often grapple with a deluge of sales data, creating a challenge in
manually sifting through and extracting meaningful insights from the substantial volume.
2. Trend Blindness: Without a structured analysis approach, small businesses find it challenging to
discern crucial sales trends, cyclic patterns, and fluctuations that play a pivotal role in influencing
their revenue.
4. Competitive Disadvantage: Smaller enterprises face a competitive disadvantage when they are
unable to leverage the full potential of their sales data, putting them at a strategic disadvantage
in the market.
CHAPTER 3
LITERATURE SURVEY
➢ Shuyun Ren “Forecasting the Retail Sales of China’s Catering Industry Using Support
Vector Machines.”
The forecast of China's catering retail sales was studied in this paper. The seasonal
impact was considered in the forecasting. The retail sales were predicted using the
seasonal auto-regressive integrated moving average (ARIMA) model. ARIMA,
SVM. SVM method is obviously superior to the seasonal ARIMA method regardless
of the long-term forecasting or the shortterm forecasting.[2]
➢ Avinash kumar Sharma “An Intelligent Model For Predicting the Sales of a Product.”
The approach shown in this paper is a systematic, accurate and precise model building
to be used in computing and predicting current scenario and future projection of a
product in market respectively. Random forest algorithm, neural network. Neural
network.[3]
➢ Kaneko and Yada “A Deep Learning Approach for the Prediction of Retail Store
Sales.”
The purpose of this research is to construct a sales prediction model for retail stores
using the deep learning approach, which has gained significant attention in the rapidly
developing field of machine learning in recent years. Using such a model for analysis,
an approach to store management could be formulated . Logistic regression model
The accuracy decreased by around 13% when the logistic regression model was
used.[8]
CHAPTER 4
OBJECTIVES
CHAPTER 5
SYSTEM REQUIREMENTS
➢ HARDWARE REQUIREMENTS
• System : i3 Processor
• Hard Disk : 500 GB.
• Monitor : 15’’LED
• Ram : 4GB
➢ SOFTWARE REQUIREMENTS
CHAPTER 6
METHODOLOGY
Sales prediction is preferably a regression problem than a time series problem. Practice shows
that the use of regression procedures can often supply us better results comparing with time series
techniques. Machine learning algorithms make it possible to find patterns in the time series.
BigMart sales dataset consists of 2013 sales data for 1559 products throughout 10 special stores
in unique towns.
We have 2 dataset the train dataset which has 8523 rows and 12 features and the test dataset
which has 5681 rows and 11 columns. The train dataset has 1 extra column which is the target
variable. We will predict this target variable for the test dataset. Calculations done in the Python
environment using the main packages pandas, sklearn, numpy, matplotlib, seaborn etc. To
conduct the analysis, we will be using Jupyter Notebook.
The goal of the BigMart sales prediction ML challenge is to build a regression model for
expecting the sales of every of 1559 products for the following year in every of the 10 specific
BigMart stores. The BigMart sales dataset additionally includes certain attributes for each
product and store. This model allows BigMart to know the properties of products and stores that
play an essential position in growing their universal sales. We divided the entire analysis process
to following five stages:
1. Exploratory data analysis (EDA)
2. Data Pre-processing
3. Feature engineering & Feature Transformation
4. Modeling
5. Hyperparameter tuning and Evaluation
trying to identify the information from hypotheses vs available data. Which shows that the
attributes Outlet size and Item weight face the problem of missing values, also the minimum
value of Item Visibility is zero which is not actually practically possible. Establishment year
of Outlet varies from 1985 to 2009. These values may not be appropriate in this form. So, we
need to convert them into how old a particular outlet is. There are 1559 unique products, as
well as 10 unique outlets, present in the dataset. The attribute Item type contains 16 unique
values. Where as two types of Item Fat Content are there but some of them are misspelled as
regular instead of ’Regular’ and low fat, LF instead of Low Fat
2. Data Cleaning
It was observed from the previous section that the attributes Outlet Size and Item Weight has
missing values. In our work in case of Outlet Size missing value we replace it by the mode
of that attribute and for the Item Weight missing values we replace by mean of that particular
attribute. The missing attributes are numerical where the replacement by mean and mode
diminishes the correlation among imputed attributes. For our model we are assuming that
there is no relationship between the measured attribute and imputed attribute
3. Feature Engineering & Feature Transformation
Some nuances were observed in the data-set during data exploration phase. So, this phase is
used in resolving all nuances found from the dataset and make them ready for building the
appropriate model. During this phase it was noticed that the Item visibility attribute had a
zero value, practically which has no sense. So, the mean value item visibility of that product
will be used for zero values attribute. This makes all products likely to sell. All categorical
attributes discrepancies are resolved by modifying all categorical attributes into appropriate
ones. In some cases, it was noticed that non-consumables and fat content property are not
specified. To avoid this, we create a third category of Item fat content i.e. none. In the Item
Identifier attribute, it was found that the unique ID starts with either DR or FD or NC. So, we
create a new attribute Item Type New with three categories like Foods, Drinks and Non-
consumables. Finally, for determining how old a particular outlet is, we add an additional
attribute Year to the dataset.
4. Model Building
After completing the previous phases, the dataset is now ready to build proposed model. Once
the model is built it is used as predictive model to forecast sales of Big Mart. In our work, we
make model based on different algorithms such as Random Forest algorithm, Linear
regression, Lasso Regression, Ridge regression, Decision tree etc. and compare it with other
machine learning techniques. All models received features as input, which are then segregated
into training and test set. The test dataset is used for sales prediction.
5. Hyperparameter tuning and Evaluation
The next and final step in our project is the tuning of different parameters in every model and
saw improvement in model performance. While this is an important step in modeling, it is by
nomeans the only way to improve performance.
CHAPTER 7
TESTING
Maximum
AUC AUC Run-time Memory
Algorithm (Training) (Holdout) (Training) Utilization
(Of 16 GB)
XGBoost 0.88 0.86 16 min 12 sec 12%
Random Forest
(Depth controlled) 23 min 10 sec 29%
0.79 0.51
KNN
(Euclidean distance) 0.52 0.5 180 min 12 seca 35%
CHAPTER 8
RESULTS
It was found that our target variable ‘Item_Outlet_Sales’ is skewed to the right, towards the
higher sales, with higher concentration on lower sales.
From the current numeric variables, we can observe that the Item_Visibility is the feature
with the lowest correlation with our target variable. Therefore, the less visible the product is
in the store the higher the price will be. This is curious since from the initial assumptions this
variable was expected to have high impact in the sales increase. Moreover, this feature has a
negative correlation with all of the other features. Furthermore, the most positive correlation
belongs to Item_MRP.
There seems to be a low number of stores with size equals to “High”. Most of the existent
stores seem to be either “Small” or “Medium. It was observed that lowest sales were produced
in smallest locations. However, in some cases it was found that medium size location
produced highest sales though it was type-3 (there are three type of super market e.g. super
market type-1, type-2 and type-3) super market instead of largest size location to increase the
product sales of Big mart in a particular outlet, more locations should be switched to Type 3
Supermarkets.
However, if we look at our results, we see that in fact it is stores from Tier 2 cities that present
the highest results, followed by Tier 3 cities and with Tier 1 cities with the lowest results of
the three type of locations.
However, the proposed model gives better predictions among other models for future sales at
all locations. The Item Outlet Sales is strongly correlated with Item MRP. Less visible items
are sold more compared to more visibility that means it describes that the less visible products
are sold more compared to the higher visibility products which is not possible practically.
CONCLUSION
In present era of digitally connected world every shop demand of product sales or user demands.
Extensive research in this area at enterprise level is happening for accurate sales prediction. As
the profit made by a company is directly proportional to the accurate predictions of sales, the Big
marts are desiring more accurate prediction algorithm so that the company will not su er any ff
losses. In this work, we have designed a predictive model by modifying Random Forest technique
and experimented it on the 2013 Big Mart dataset for predicting sales of the product from a
particular outlet. Experiments support that our technique produces more accurate prediction
compared to than other available techniques like decision trees, ridge regression etc.
REFERENCES
[1] Sunitha Cheriyan, Shaniba Ibrahim, Saju Mohanan & Susan Treesa (2018) Intelligent
Sales Prediction Using Machine Learning Techniques.
[2] Xiangsheng Xie & Gang Hu (2008). Forecasting the Retail Sales of China’s Catering
Industry.
[3] Avinash kumar, Neha Gopal & Jatin Rajput(2020). An Intelligent Model For Predicting
the Sales of a Product.
[4] Purvika Bajaj, Renesa Ray, Shivani Shedge & Shravani Vidhate(2020). SALES
PREDICTION USING MACHINE LEARNING ALGORITHMS.
[5] Ching-Seh (Mike) Wu. Pratik Patil & Saravana Gunaseelan(2018). Comparison of
Different Machine Learning Algorithms for Multiple Regression on Black Friday Sales
Data.
[6 ] Nikhil Sunil Elias, Seema Singh(2019).FORECASTING of WALMART SALES using
MACHINE LEARNING ALGORITHMS.
[7] Yuta Kaneko & Katsutoshi Yada(2016). A Deep Learning Approach for the Prediction of
Retail Store Sales.
[8] Gopal Behera & Neeta Nain (2019). Sales Prediction For Big Mart.
https://www.kaggle.com/aakash2016/big-mart-sales-prediction
https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/
https://medium.com/diogo-menezes-borges/project-1-bigmart-sale-prediction-fdc04f07dc1e
https://rstudio-pubsstatic.s3.amazonaws.com/381886_981132516a8e437284327a405ca4d91a.html