Gargee
Gargee
Conference Series
Abstract. E-commerce is a platform where people are able to buy and sell goods. The main
purpose of e-commerce is to provide convenience to the customers where they do not have to
go to a physical store to make a purchase. As the will be able to make the purchase online and
the item will be in their door step in the following days. In 2019, a total of $603 billion worth
of sales were done via e-commerce in the United States compared to 3.17 billion in retail sales
in the United States. The purpose of this study was to build machine learning algorithms which
are able to forecast the sales of the e-commerce platform. A research was being done to
understand the literature reviews based on similar systems and similar studies that relates to the
researcher project. The purpose of doing this literature review is to understand which machine
learning model was being used by other studies so the researcher will be able to select some of
the best machine learning models for this study. Once the researcher has selected the models,
he will them build the models and test their accuracy, error and performance. At the end, the
researcher will compare all of the model’s accuracy and errors to get the best model which
have low error and high accuracy for forecasting sales. The model which have been fulfil the
criteria, will be integrated into the system which is being built by the researcher. The system
will give a view of the current and forecasted sales.
1. Introduction
The domain for this research is E-commerce. Starting a new method to gather and analyze data could
be a huge impact to an organization, as the outcome can be positive or it can go the other way. E-
Commerce Platforms collect a large amount of data and store it in their data centers. They fail to look
at this as an advantage for their business opportunity such and analyzing the data and its pattern trough
out the years. For example, All the customer data from registration, search history, sales, chats are
being stored in their server and will be only be used when there is a problem with existing data. It’s
understandable why they won’t want other company to analyze their data is due to privacy issues but
they are also able to create their own team to analyze the data which can be profitable for them.
Its shown that E-Commerce are one of the top 10 business which have a very large amount of data
storage. With the access to the amount of data they will be able to create a game changing
environment for the E-Commerce industry. This industry has been spending hundreds of millions of
dollars in advertising, social media, sorting secured data and much more to generate more sales but
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
ICCPET 2020 IOP Publishing
Journal of Physics: Conference Series 1712 (2020) 012042 doi:10.1088/1742-6596/1712/1/012042
they didn’t realize that with machine learning they will be able to step up they game against their
competitors. Machine learning is a large tree branch which have many specializations such as data
mining, artificial intelligence, augmented reality and prediction. For this research, will be only
focusing on the prediction using machine learning.
With the ability of prediction using machine learning algorithm for e-commerce, we will be able to
identify any hidden patterns, outliers, point of interest (POI) and much more. This will allow e
commerce to be able to properly identify the important details in each and every aspect. They will be
able to use all their data such as amount of product purchased, product categories, payment method,
interest rate, duration of delivery and customer location to have a better understanding on how to
improve and manage their sales.
If the e-commerce platform is able to forecast its sales for the upcoming month or day, they will be
able to make better business decisions. They will also be able to track and trends in their sales if any
festival or event happens yearly. They will also be able to keep track of their inventory so all items
will be stock sufficiently which will avoid any overstocking and understocking of a product as they
will be able to get a rough estimate of purchases which are likely to happen. Not only that, they will
also be able to keep better track at their finance and make reasonable purchases option and have a
proper budget throughout the business operation.
2. Literature Review
Sales prediction is an essential task which have to be done by the e-commerce and the prediction will
be able to provide crucial impact to towards the business decision making process. Not only that by
having sales prediction for the e-commerce platform, they can have a better understanding about their
financial status to manage the workforce, and further improving their supply chain management
system. Based on [1] and [2], a sales prediction allows the e-commerce platform to have a better
accuracy and reliable prediction which will help them with inventory planning, competitive price, and
timely promotions strategies. According to [3], The prediction of e-commerce sales allows to
understand the lifecycle of the e-commerce platform as its sales and growth, stability, decline and how
are the sales being affected by short terms product goals such as promotion, pricing, season and
ranking online.
According to the research conducted by [1], they have used the convolution neural network (CNN)
algorithm to do sales forecasting in e-commerce. This research was being done to solve the identified
limitation which was method require case-by-case manual feature engineering for specific scenarios
which is difficult, time-consuming and requires a lot of expert knowledge. However, the goal for this
research was to identify if this approach can automatically extract the effective features and provide
the sales forecasting based on extracted features was mention by [1]. The main algorithm which was
used for this research was the CNN algorithm to perform the sales prediction. However, for
comparison purpose the research have chosen ARIMA, DNN, TL and WD algorithm to find the most
accurate results for the sales prediction. The researcher also has used sample weight decay and transfer
learning technique to further improve the forecasting accuracy further, which have been proved to be
highly effective in the experiments. Based on the MST boxplot, ARIMA model have the highest
average value, however the CNN algorithm achieved its goal where it can automatically extract the
effective features and do a sales forecasting using the extracted features.
Based on the researches which was conducted by [2] and [3], they have both chosen neural network
algorithm. But both of this neural network algorithm have their own approach where Nonlinear
Autoregressive Neural Network (NARNN) is used by the 2018 research and the 2019 research have
conducted the research using Recurrent Neural Networks (RNN) and Long Short-Term Memory
Networks (LSTM) algorithm which is a special neural network. These researches have used this
approached algorithm to find sales prediction and demand of e-commerce. The problem that the
researches have stated is similar which is difficulty in identifying the different cross-product
demand/sales pattern and the correlations which are available. The goal for both of the research paper
was to purpose a systematic pre-processing framework to overcome challenges in e-commerce settings
2
ICCPET 2020 IOP Publishing
Journal of Physics: Conference Series 1712 (2020) 012042 doi:10.1088/1742-6596/1712/1/012042
and also purposed a forecasting framework. The algorithm which have been used by to compare both
of these researches was ARIMA (time series analysis). The results discussion for the 2018 research
have shown that the prediction error for NARNN is at 0.1016 and ARIMA was at 0.1389, which
shows that NARNN have a lower error rate compared to the ARIMA. For the research in 2019, the
results also show that LSTM has a lower mean and median compared to ARIMA.
Sales prediction is usually done by using the most common method, time series analysis. Time
series analysis involve the Autoregressive function which helps which any type of prediction analysis.
According to [4] study on the machine learning model for sales times series forecasting, it has mention
that sales prediction is a modern business intelligence method. Also mention by [3], ARIMA model
has a better approach for the performance in prediction in the time series analysis. This main problem
stated in this research by [4] is that for time series data, the data required is large to capture the
seasonality and the large transactional sales data can have many missing data and outliers. These data
will then need to be take into account a lot of different factors which can impact the sales. The goal for
this time series analysis it to combine different time series algorithm in order to improve this
prediction accuracy. There were five algorithm which have been selected in the research which are
ExtraTree, ARIMA, RandomForest, Lasso and Neural Network which are all time series algorithm
and supervised. Based on the results for the forecasting error testing, ExtraTree has the highest
validation error compared to the rest and Neural Network has the lowest validation error making it one
of the best algorithms for prediction.
Based on the other researcher [5], have conducted a research on forecasting of Walmart sales using
machine learning algorithms. The key for this research was done by implementing several different
classification algorithms in the sales data from all different Walmart locations all over the united
states. The problem which was highlighted in this research was creating a competitive comparative
analysis to find the best algorithm. The researcher had selected 3 different algorithms for the
comparison and test it using the MAE evaluation R^2 Score. The goal of this research is to find
accuracy of algorithm using different hyperparameters of each model to obtain the best Mean Absolute
Error (MAE) and R^2 score. The algorithm which was used for this research was Random Forest,
Gradient Boosting and Extremely Randomized Tree (Extra Tree). The results of this research indicate
that the Random Forest is the best algorithm which have scored the minimum amount in MAE
evaluation (1979.4) and a high R^2 (0.94) score which have shown a high accuracy compared with the
others.
Another research was conducted by [6] which was to study the sales forecast for Amazon sales
based on the different statistic methodology. This research has primarily focused on the amazon data
and forecast the future sales using the historical data by using statistic algorithms. The problem that’s
identified by the lecturer in this research is the how a statistical methodology for sales methodology
can can help in sales forecasting. Statistical methodology algorithm are a part of the time series
analysis models [4]. The goal for this research is to conduct sensitivity analysis on the three methods,
and identify which is the most reliable, accurate and suitable approach. The better the accuracy for a
method, the better will the prediction will be for the sales forecasting. There were three different
approaches used for this research which includes Winters’ exponential smoothing, time-series
decomposition and ARIMA. The results of this research were done by measuring the forecasting error
(RMSE). All of the method has a very low amount of forecasting error, therefore all of the method can
be implemented to conduct sales forecasting for the Amazon sales.
The research which was conducted by [7], regarding the car sales prediction using machine
learning algorithm. This research emphasizes on the data about car sales and how they are derived
from various sources. The main issues which is identified by the researcher was that getting varied
idea about how well the various criteria’s in our dataset works and identifying the appropriate
algorithm which can be used. The outcome of this research needs to apply multiple different machine
algorithm on the sales dataset and provide proper analysis of algorithm used [8]. Sales of car doesn’t
contain independent variable as most of the factor such as car size, petrol capacity, price, height and
tire are some of the features which influence the sales of the cars. The algorithm which is being used
3
ICCPET 2020 IOP Publishing
Journal of Physics: Conference Series 1712 (2020) 012042 doi:10.1088/1742-6596/1712/1/012042
for this research is Random Forest. The results for the random forest determine that the price is the
main attribute that will make large impact on the sale of the car sales value. The random forest is also
high accuracy percentage (above 85%).
Another research which was done by [9], which discuss about Explaining Machine Learning
models in sales prediction. This research mainly discusses about all the main models of machine
leaning which is commonly being used in sales prediction and also will show analysis of the best
machine learning model available. The problem which is identified by this research paper was that
how to identify the appropriate model based on the business understanding by using the intelligence
and data driven models. The goal of this research was to demonstrate how effective each of the model
is and its usability. This is being done to ensure that the correct method have been selected to the
selected business environment was mentioned by [10] and [4]. The method (algorithm) which was
been used by this research was the decision tree, neural network, naïve bayes, random forest and
support vector machine (SVM). Based on the chosen algorithm, the results are tabulating against the
accuracy. The random forest is at 85%, naïve bayes is at 83%, decision tree at 76%, neural network at
70% and finally the SVM is at the lowest 59%. Therefore, the best method to be chosen is the random
forest as it has a high accuracy model at 85%.
According to the research which was conducted by [11] which discuss about the Machine Learning
for Restaurant Sales Forecast. In this research, it explains on how restaurant can be implementing
machine learning to improve and understand the sales. The problem identified is that many restaurants
do not have solid forecast of their daily sales. This is because they don’t have proper education about
calculating the sales prediction. The goal for this research is to investigate the possibility to create a
forecasting solution based on the supervised learning. This will help the restaurant business to record
and analyses the sales and can make better decision in relation to the financing. The algorithm which
have been used by this research is Extreme Gradient Boosted and Long Short-Term Memory Boosted.
The results of this study were that the Extreme gradient boosted algorithm works perfectly in this
testing approach while the LSTM has some limited support on these problems.
4
ICCPET 2020 IOP Publishing
Journal of Physics: Conference Series 1712 (2020) 012042 doi:10.1088/1742-6596/1712/1/012042
3.4. Modelling
Modelling is a process where you identify the algorithms which you are going to be using for the
project research purpose. For this research, we will be using two different algorithms which are
Gradient Boosting and Random Forest. The algorithms are being selected is because there are
commonly related in prediction analysis. Gradient boosting is a machine learning technique which
involves classification and regression to product based on weak prediction models such as decision
tree. Random Forest contain a huge amount of decision tree that works in a group and each of the
individual tree will provide a class prediction.
The models which are going to be tested out will be based on the time series analysis. There are
two modelling method which are going to be tested for this research which are Autoregressive
Integrated Moving Average (ARIMA) and Seasonal Autoregressive Integrated Moving Average
(SARIMA). These models are chosen as they are some of the best models out there which can provide
a proper accurate accuracy for the prediction.
5
ICCPET 2020 IOP Publishing
Journal of Physics: Conference Series 1712 (2020) 012042 doi:10.1088/1742-6596/1712/1/012042
4.1. Model
Train Test RSME MAE R^2
Accuracy Accuracy
6
ICCPET 2020 IOP Publishing
Journal of Physics: Conference Series 1712 (2020) 012042 doi:10.1088/1742-6596/1712/1/012042
5. Conclusions
By doing this project of using machine learning for forecasting the ecommerce sales, it was noticed
that the in this project, there are many different method of forecasting the sales of the ecommerce
platform but the researcher was only able to focus on only four algorithms which are commonly being
used when forecasting the sales of the future. The researcher was able to build and test all of the
selected machine learning models which have been selected. The model which have the best
prediction range, where the predicted value and the actual value are almost similar is chosen as the
best algorithm. This best algorithm will then be integrated into a web application which will also built
by the researcher.
References
[1] Zhao, K. and Wang, C. (2017) ‘Sales Forecast in E-commerce using Convolutional Neural Network’, (August 2017). Available at:
http://arxiv.org/abs/1708.07946.
[2] Bandara, K. et al. (2019) ‘Sales Demand Forecast in E-commerce using a Long Short-Term Memory Neural Network Methodology’.
Available at: http://arxiv.org/abs/1901.04028.
[3] Li, M., Ji, S. and Liu, G. (2018) ‘Forecasting of Chinese E-Commerce Sales: An Empirical Comparison of ARIMA, Nonlinear
Autoregressive Neural Network, and a Combined ARIMA-NARNN Model’, Mathematical Problems in Engineering, 2018, pp. 1–12.
doi: 10.1155/2018/6924960.
[4] Pavlyshenko, B. (2019) ‘Machine-Learning Models for Sales Time Series Forecasting’, Data, 4(1), p. 15. doi: 10.3390/data4010015.
7
ICCPET 2020 IOP Publishing
Journal of Physics: Conference Series 1712 (2020) 012042 doi:10.1088/1742-6596/1712/1/012042
[5] Elias, S. and Singh, S. (2018) ‘FORECASTING of WALMART SALES using MACHINE LEARNING ALGORITHMS’
[6] YU, J. and LE, X. (2017) ‘Sales Forecast for Amazon Sales Based on Different Statistics Methodologies’, DEStech Transactions on
Economics and Management, (iceme-ebm). doi: 10.12783/dtem/iceme-ebm2016/4132.
[7] Madhuvanthi, K. et al. (2019) ‘Car sales prediction using machine learning algorithmns’, International Journal of Innovative
Technology and Exploring Engineering, 8(5), pp. 1039–1050.
[8] Xia, G. and He, Q. (2018) ‘The Research of Online Shopping Customer Churn Prediction Based on Integrated Learning’, 149(Mecae),
pp. 756–764. doi: 10.2991/mecae-18.2018.133.
[9] Bohanec, M., Kljajić Borštnar, M. and Robnik-Šikonja, M. (2017) ‘Explaining machine learning models in sales predictions’, Expert
Systems with Applications, 71(April), pp. 416–428. doi: 10.1016/j.eswa.2016.11.010.
[10] Mohammed, M., Khan, M. B. and Bashie, E. B. M. (2017) Machine learning: Algorithms and applications, Machine Learning:
Algorithms and Applications. doi: 10.1201/9781315371658.
[11] Holmberg, M. and Halldén, P. (2018) ‘Examensarbete 30 hp Maj 2018 Machine Learning for Restaurant Sales Forecast’. Available at:
http://www.teknat.uu.se/student.
[12] Brownlee, J. (2019). 11 Classical Time Series Forecasting Methods in Python (Cheat Sheet). [online] Machine Learning Mastery.
Available at: https://machinelearningmastery.com/time-series-forecasting-methods-in-python-cheat-sheet/ [Accessed 14 Oct. 2019].
[13] Ceriotti, M. (2019) ‘Unsupervised machine learning in atomistic simulations, between predictions and understanding’, Journal of
Chemical Physics, 150(15). doi: 10.1063/1.5091842.
[14] Data-Driven-Science (2018). Python vs R for Data Science: And the winner is... [online] Medium. Available at:
https://medium.com/@data_driven/python-vs-r-for-data-science-and-the-winner-is-3ebb1a968197 [Accessed 2 Oct. 2019].
[15] DBD, U. (2019). KDD Process/Overview. [online] Www2.cs.uregina.ca. Available at:
http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html [Accessed 6 Oct. 2019].
[16] Hyde, K. K. et al. (2019) ‘Applications of Supervised Machine Learning in Autism Spectrum Disorder Research: a Review’, Review
Journal of Autism and Developmental Disorders. Review Journal of Autism and Developmental Disorders, 6(2), pp. 128–146. doi:
10.1007/s40489-019-00158-x.
[17] Klassen, S., Weed, J. and Evans, D. (2018) ‘Semi-supervised machine learning approaches for predicting the chronology of
archaeological sites: A case study of temples from medieval angkor, Cambodia’, PLoS ONE, 13(11), pp. 1–17. doi:
10.1371/journal.pone.0205649.