Analytical Methods of Machine Learning Model For E-Commerce Sales Analysis and Prediction
Analytical Methods of Machine Learning Model For E-Commerce Sales Analysis and Prediction
Abstract— In the commercial market, E-commerce sales these strategies. Machine learning is the discipline where the
show a significant trend and have attracted many consumers. E- system utilizes algorithms to excel humans [4]. It learns from
commerce sales forecasting has a significant role in an the patterns hidden in the data and observes the result. These
organization’s growth and aids in improved operation. Many techniques find the optimum result with minimal human
studies have been conducted in the past using statistical, intervention. It has proven beneficial in sales forecasting by
fundamental, and data mining techniques for better analysis providing great insight into future sales. In our study, we
and prediction of sales. However, the current scenario calls for envisaged supervised machine learning algorithms to yield
a better study that combines the available information to optimum results in the prediction.
propose different machine-learning techniques. The sole motive
of the study is to analyze and determine different machine Our research proposes methodology to obtain predictive
learning models to predict accurate results. The research sales data with a minimal error rate, consequently escalating
observed that the Extreme Gradient Boosting model the accuracy. For this purpose, eight machine learning
outperformed all other models and brought a good result. It algorithms were built and contrasted. Advanced boosting
produced an RMSE value of 0.0004 and Explained Variance models were utilized along with regression models to obtain
score of 0.99. Decision Tree algorithm also shows an exemplary impeccable results. RMSE value and variance score are used
result. as the evaluation metrics to estimate the efficiency of the
models.
Keywords— Sales Prediction, Machine Learning, Boosting,
Explained Variance, Extreme Gradient Boosting. The remaining paper is structured as follows: The
following section looks at scientific research on sales
I. INTRODUCTION forecasting. The third section defines the proposed
E-commerce is expanding in today's financial world [1]. It methodology and the steps followed with it. Fourth section
made a paradigm shift that affected both marketers and explains the experimental analysis and results. In addition, the
customers. The COVID-19 crisis also hastened the growth of fifth section concludes the study by speculating future
online shopping [2]. Customers can now purchase a variety of possibilities. Finally, in the following section, references are
goods from their own residence, and businesses can continue included.
to run despite the restrictions. Moreover, the world is evolving
fast, and the business sectors are on edge with technology to II. RELATED WORK
meet market demands. The corporate industries are striving to There have been several recent works in sales prediction
satisfy customer requirements and simultaneously procure that use machine learning and data analytics techniques. This
profit from their investment. Therefore, comprehending the section provides insight into the methods adopted by
future trend plays an important role. Sales prediction aids in numerous researchers in sales prediction. The study suggests
estimating the upcoming sales of the firm in advance [3]. It an unique deep learning strategy for to predict stock
helps in effective decision making and appropriate resource movement [5]. Two recurrent neural networks are combined
allocation. Thus, it is ideal for enhancing revenue and in the model using the blending ensemble learning technique,
promoting the organization’s growth. which is followed by a fully connected neural network. The
Sales prediction anticipates future sales and provides a results demonstrate that blending ensemble deep learning
way to analyze the company's performance. Traditional model exceeds the leading prediction model currently in use.
forecasting methods relied heavily on expert employee The study proposes a hybrid network combining
suggestions or quantitative analysis of historical data. Convolutional Neural Network and Bi-directional Long
Machine learning approaches have been seen to outperform Short-Term Memory to predict ecommerce [6]. Various types
of data are normalised via feature engineering. All the
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on March 18,2024 at 23:16:38 UTC from IEEE Xplore. Restrictions apply.
comments are analysed using BiLSTM. With the information dataset to provide an outstanding result. The XGBoost
that feature engineering has provided, CNN is used to create regression model surpassed all other models producing high
predictions. The research was performed using clustering accuracy on test data. Advanced regressive and tree models
model techniques and estimation for sales [7]. Based on the also performed well by producing a good accuracy for the test
review, the best-fit prediction model for the firm's marketing data. The RMSE value was nearer to zero showing a lower
was suggested. Similarly, a study on Walmart sales using error rate, thus creating high accuracy.
machine learning techniques was conducted [8]. The research
focused on acquiring the best result using three main
classification models. R^2 score and Mean Absolute Error
(MAE) metrics with appropriate hyperparameters were used
to evaluate the algorithms. The Random Forest algorithm was
chosen as the best algorithm with a minimal error rate and high
R^2 score.
Furthermore, a study on car sales forecasting using
machine learning was conducted [9]. The article concentrates
on the production of vehicle sales statistics from several
sources. The researchers used a random forest algorithm for
their survey, and according to its results, the price was the
most important feature that significantly impacted the sale of
a car. The study resulted in a reasonable accuracy rate that
hovered above 85 percent. A survey on market data and sales
prediction was illustrated [10]. The study investigates the
conclusions from the experimental data and the insights
gained via data visualization and mining methods. The article
mainly deals with three machine learning algorithms (
Generalized Linear Model, Gradient Boost and Decision Tree)
of which Gradient Boost yields the best result. A notion to
minimize the error rate of prediction using the Extra Gradient
Boost algorithm along with the assistance of the SigOpt
Bayesian Optimization technique [11].
This work suggests a revolutionary prediction model to
yield an outstanding result by providing perfect accuracy and
suppressing the error rate to a minuscule level. Advanced
machine learning models were analyzed and deployed on a
large dataset. Data transformation techniques such as label
encoding and feature scaling were employed to transform the
data. Also, hyperparameters were tuned to build an
outstanding model.
III. PROPOSED METHODOLOGY Fig. 1. System Architecture
In our research, a five-tier approach is taken to address the
issue of sales prediction. Firstly, the Global Superstore dataset A. Dataset Collection
was collected from an open database, Kaggle Repository [12]. The Global Superstore dataset was collected from the
It has 24 attributes showcase the customer, product, order, and Kaggle repository [13]. Global Superstore is a collection of
sales details. The dataset consists of online retail store order data from various New York-based online retailers collected
information gathered from 147 countries. These data and organized in the form of a dataset. The dataset constituted
underwent a preliminary analysis, where the data was order data of Global Superstore gathered from 147 countries
visualized and interpreted. Secondly, data pre-processing is worldwide. The retail dataset consists of 51290 entries which
carried out, where the data gets cleaned from all the noises and is standard. The 24 attributes showcase the customer, product,
prepared for the forthcoming stages. Thirdly, feature order, and sales details. The customer information includes the
transformation is performed to forge the data into a more customer name, id, segment, city, and state. The order
understandable and reliable form. In the fourth stage, the data information gives an idea about the order date and order id.
is branched into training and testing sets to decrease the Product id, category ,product name, and sub-category explain
complexity. Later in the fifth stage, the branched information the product details. Quantity, sales, discount, and profit show
is fed into different models that have been constructed and an overview of the sales statistics. These particulars with
evaluated for the results. Figure 1 shows the architecture suitable training are used for constructing the model. Table 1
diagram. shows the data description of the sales data set.
The significant findings of our research are the following. B. Data Pre-Processing
Our proposed approach provides an ideal model for sales
prediction in an e-commerce market. A variety of statistical Data cannot be used in machine learning algorithms in its
methods were explored and compiled to create the best model natural state because of the way it was gathered; hence the data
for predicting sales. Data transformation techniques such as must be synthesized before being employed; Pre-processing
label encoding and feature scaling were employed on the aids in this process. Data pre-processing ensures that the
features are consistent, complete, and flawless. The steps that
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on March 18,2024 at 23:16:38 UTC from IEEE Xplore. Restrictions apply.
make up data preparation are as follows. Firstly, the dataset adopted to forecast accurate results. The following section
and the necessary python libraries are imported. The Global describes the algorithms used in this research.
Superstore CSV file is attached for the proceeding. Based on
the data, a few categories (Order_Id, Customer_Id, The K-Nearest Neighbors algorithm is a supervised
Customer_Name, Row_Id, Ship_Date, Order_Date) were learning approach by Fix [15,16]. It estimates the result based
eliminated as these parameter does not contribute to final on the proximity of other available samples. The closeness of
output. data is computed using the distance function, commonly the
Euclidean distance.
Later missing values are handled to reduce the
discrepancy of data. Postal_Code contains 41296 null values Euclidean Distance = ∑ (y − x ) (1)
in our data, which are dropped to reduce the noise. Equation 1 is the euclidean distance formula that estimates
the distance between the points where x and y are points.
TABLE I. DATA DESCRIPTION
Decision trees are Supervised machine learning algorithms
Attribute Description Data Type proposed by Quinlan [17]. A decision tree regressor is a
Row_Id Distinct Id int64 statistical method, for instance, classification based on
Order_ID Unique identifier for each order Object attribute values. They usually begin from the cluster head and
Order_Date The ordered date datetime64
break into subgroups to build subtrees.
Ship_Date The shipped date datetime64
Ship_Mode The shipping mode Object Random Forest is a kind of efficient ensemble machine
Customer_ID Unique identifier for each customer Object learning approach for predictive modeling [18,19]. It is
Customer_Name The name of the customer Object described as a combination of decision trees that uses bagging
Segment The type of the consumer Object
and boosting techniques to help deliver correct output. Jerome
City The city of the customer Object
State The state of the customer Object
Friedman pioneered a new boosting algorithm named
Country The country of the customer Object Gradient Boosting for statistical analysis [20]. A Gradient
Postal_Code The postal code of the customer float64 Boosting Algorithm is an ensemble learning technique that
Market The market of the sales Object integrates predictions from several decision trees to provide
Region Region of the sales Object the result. Extreme Gradient Boosting, commonly referred to
Product_ID Unique identifier for each product Object as XGBoost, is another main boosting method. In XGBoost,
Category Category of the product Object trees are produced in sequential order, with each tree
Sub_Category The sub-Category of the product Object
attempting to fix the mistakes of the preceding trees. It uses a
Product_Name The name of the product Object
Sales The sales made by the product float64 gradient-based optimization algorithm to train the trees, which
Quantity Quantity sold by each product int64 makes it more efficient than traditional gradient boosting
Discount Discount offered to each product float64 methods. It uses regularization to prevent overfitting, which
Profit The profit obtained from each sale float64 can lead to better generalization performance. It allows for
Shipping Cost The shipping cost of each product float64 custom loss functions and evaluation metrics and can handle
Order Priority The order priority Object missing values and categorical features. LightGBM, or Light
Gradient Boosting Machine, is a Gradient boosting algorithm.
C. Data Transformation
When compared to other algorithms, the trees in Light
Data transformation is a way of transforming the gradient boost grow leaf-by-leaf. As the title implies,
unstructured data into a suitable format for prediction. CatBoost is a boosting technique that can accommodate
Converting the data assures the highest possible data quality, categorical features in data. They deal with the descriptive
which is essential for proper interpretation. Additionally, it features in data on teir own. The adaptive Boosting Algorithm,
will improve the outcome. Label encoding is performed to commonly addressed as Adaboost, is an ensemble technique
convert the categorical features to numerical representations. used in supervised learning. The Adaboost algorithm works in
This ensures that the dataset is enriched with adequate the same way as boosting does.
statistical data. Label encoding was performed on the
categorical attributes such as Ship_Mode, City, Segment, IV. EXPERIMENT SETUP AND PERFORMANCE MEASURE
State, Region, Country, Market, Product_ID, Category, Sub- The complete dataset was branched into train and test sets to
Category, Product Name, and Order Priority. As a result, the minimize the complexity. The train set is used to fit the model,
model can learn a detailed representation of the data. whereas the test set is used to assess the train set. The train and
Furthermore, feature scaling is employed to convert data test split are 70% and 30%, respectively. Machine Learning
into a consistent and scalable size to improve precision and algorithms such as Random Forest Regression, Decision Tree
eliminate errors. It prohibits the algorithm from using a wide Regression, KNearest Neighbors, CatBoost, XGBoost
variety of data points, thus attaining better outcomes. Here, algorithm, LightGBM, Gradient Boost, and AdaBoost
normalization using min-max scaling is utilized to bring down algorithm have been used to anticipate the sales. To achieve
the features to a standard scale and more comparable form. an excellent result, the models were trained using appropriate
hyperparameters. Table 2 showcases the model's parameters
Lastly, the data is split into train and test sets. The training and its values.
set was used to build a model that recognized the sales pattern,
while the testing set was used to evaluate the model and assess Different metrics such as train and test sore accuracy,
its predictive skills. 70 percent of data was used in the test set, mean absolute error (MAE), variance score and root mean
while 30 percent of samples were used to assess the model. squared error (RMSE) and were adopted in our research.
Explained Variance Regression Score is the disparity between
D. Algorithms a model and actual data measured using explained variance
Algorithms are the procedure that is used to produce regression score. The best score is 1.0, while the lesser
models for pattern recognition. A variety of algorithms are numbers are worse. It is equated using equation 2 where y is
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on March 18,2024 at 23:16:38 UTC from IEEE Xplore. Restrictions apply.
each value in the dataset and y is the mean of all values in the embrace the test and train data. Figure 2 depicts the test and
dataset and var is the biased variance. train score.
Score
Regression
Gradient Boosting learning_rate=0.3, max_depth=9, verbose=False
0.6
Regression 0.4
Adaptive n_estimators = 100 0.2
Boosting
0
Regression
Light Gradient boosting_type='gbdt',
Boosting Machine max_depth=9, learning_rate = 0.5,
Regression feature_fraction = 0.8, min_data_in_leaf= 100,
bagging_fraction= 0.3, metric= 'rmse',
random_state=100, seed=4, objective='regression',
num_leaves =60 ML Models
Categorical learning_rate=0.3, max_depth=9, verbose=False
Boosting Train Score Test Score
Regression
Extreme Gradient learning_rate=0.5, max_depth = 9, silent= 1, seed= 4,
Boosting objective= 'reg:linear' Fig. 2. Train and Test Scores
Regression
The XGBoost model's Root Mean Square Error (RMSE)
is comparatively lower than other models. This implies that
Explained Variance = 1 − {
{ }
}
(2) the model tried to achieve a reasonable prediction rate. The
residual error of the Decision Tree and Random Forest ranges
The standard deviation of the predicted errors is used to simultaneously. Also, CatBoost and LightGBM brings down
determine the Root Mean Square Error (RMSE). These errors the error to a comparable rate scaling 0.002. Additionally, it is
define how the data points are grouped around the regression evident that KNN has the highest error rate compared to other
line. The RMSE value will differ based on the alignment of models indicating a frail prediction capability. Table 3
errors on this line. This factor determines how close the valid showcase the overall performance of each model using
data is to the line. The RMSE value is evaluated using RMSE, Explained Variance Score and MAE.
equation 3 where n is the total number of values and predicted
and actual gives the predicted and actual values. TABLE III. PERFORMANCE OF EACH MACHINE LEARNING MODEL
,'- = (1//) ∗ 1|3$ − ŷ$| (4) 3 Random Forest 0.0006 0.99 0.0002
where,n is the number of samples in the dataset, yi is the 4 Gradient Boosting 0.006 0.99 5.557e-05
actual value of the target variable for the i-th sample, ŷi is the
predicted value of the target variable for the i-th sample. 5 Adaptive Boosting 0.012 0.99 0.009
Lower values of MAE indicates better model performance. 6 Light Gradient 0.002 0.99 0.0009
Boosting Machine
V. RESULT AND DISCUSSIONS 7 Categorical 0.002 0.99 0.0009
The train and test accuracy produced by Extreme Gradient Boosting
Boosting regression were greater than all other models. This 8 Extreme Gradient 0.0004 0.99 2.492e-05
Boosting
suggests that Extreme Gradient Boosting has a superior fit for
the data. Extreme Gradient Boosting brought exemplary train Figure 3 compares the RMSE Values of different models.
and test accuracy. However, except KNN, other regression The RMSE value of KNN is higher than all the models. The
models churn out and have good accuracy in both the training XGBoost and the Catboost is having the lowest RMSE when
and test scores of the data. This implies that the efficiency of compared to other models. Explained Variance Score shows a
Random Forest, Decision Tree, Gradient Boosting, CatBoost, value of 0.99 for the XGBoost Regression. However, all
LightGBM, and AdaBoost was equally good. KNN achieved model except KNN depicts an unprecedented precision rate
a train score of 77% and a test score of 71%. The significant aligning to 0.99.
difference in these scores suggests that the model did not fully Overall, the XGBoost outperforms all other models in a
good course of action. This regression model endeavor to
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on March 18,2024 at 23:16:38 UTC from IEEE Xplore. Restrictions apply.
acquire high accuracy to the train and test score. Furthermore, learning technique to forecast sales of a supermarket chain
it brought down the error to a negligible rate. Therefore, store. K-Nearest Neighbor, Random forest and Gradient
XGBoost Regression can be regarded as the pre-eminent Boosting were employed in his study. Random Forest
model. It is also evident that all other models except KNN Algorithm outperformed all other models producing an MAE
tried to yield the best result. When compared to other models, of 0.4. Elias et al. forecasted Wallmart Sales using three
KNN succumbed to underperform. Figure 4 compares the classification models. Performance analysis was performed
explained variance regression score of different models. using the R^2 score and Mean Absolute Error (MAE) metrics.
Random Forest Algorithm obtained 94 percent accuracy.
In our research, the XGBoost regression Regression model
beat other models by delivering high train and test scores
while lowering the RMSE value to 0.0004 and MAE of
2.492e-05
VI. CONCLUSION AND FUTURE SCOPE
Sales forecasting plays an important role in commercial
world. It aids in anticipating the revenue and profit of an
organization. It is an excellent approach to increase income
and boost company growth. In this study, we developed a
methodology to examine sales prediction. We proposed an
innovative approach using a machine learning algorithm to
Fig. 3. Comparing RMSE Value of different Models prevent overstocking and lower product wastage. For this
purpose, eight different algorithms were trained and employed
for prediction. It includes KNN, Decision Tree Regression,
Random Forest Regression, and ensemble boosting methods
such as XGBoost, Light Gradient Boost, AdaBoost,
Categorical Boost. The proposed method considers the
Explained Variance performance of each regressive model over time and computes
the model efficiency using different evaluation metrics. It was
1.2 0.99 0.99 0.99 0.99 0.99 0.99 0.99 observed that XGBoost Regression excellently surpasses all
other models. XGBoost Regression has been shown to procure
Variance Score
1
0.8 perfect accuracy outrunning all other algorithms. It was
0.6 0.49 evident that, except for KNN, all other algorithms attempted
to achieve the best outcome.
0.4
0.2 For future enhancement, to mitigate the trade loss,
0 inconsistency, and periodic tilt, more targeted and efficient
tactics must be implemented to maximize profit and remain
competitive. For the prediction, we intend to use deep learning
techniques. Highly efficient and advanced neural network
approaches could be implemented to examine the forecasting
performance. Further evaluations can be performed in the
ML Models future, and the bestselling products and price optimization
strategy will be suggested.
Explained Variance REFERENCES
[1] Hendra, E. S. Rini, P. Ginting and B. K. F. Sembiring, "Impact of
Fig. 4. Comparing RMSE Value of different Models eCommerce service quality, recovery service quality, and satisfaction
in Indonesia," 2017 International Conference on Sustainable
Information Engineering and Technology (SIET), 2017, pp. 35-40, doi:
TABLE IV. COMPARISON OF EXISTING AND PROPOSED METHODOLOGY
10.1109/SIET.2017.8304105.
Existing Study Method Performance [2] C. Zhan, C. K. Tse, Y. Gao and T. Hao, "Comparative Study of
Analysis COVID-19 Pandemic Progressions in 175 Regions in Australia,
Panjwani et al. [7] Random Forest Classifier 83.33% -Accuracy Canada, Italy, Japan, Spain, U.K. and USA Using a Novel Model That
Odegua [14] Random Forest Algorithm MAE-0.409178 Considers Testing Capacity and Deficiency in Confirming Infected
Elias et al. [8] Random Forest Algorithm 94%-Accuracy Cases," in IEEE Journal of Biomedical and Health Informatics, vol. 25,
no. 8, pp. 2836-2847, Aug. 2021, doi: 10.1109/JBHI.2021.3089577.
Proposed Study XGBoost Regression RMSE-0.0004
[3] K. Rebane, M. Teichmann and K. Rannat, "Dynamics of the Public
MAE-2.492e-05
Satisfaction with Situation Management During COVID-19 Pandemic:
Developments from March 2020 to January 2022," 2022 IEEE
Table 4 shows the comparison between the existing and Conference on Cognitive and Computational Aspects of Situation
proposed studies. Panjwani et al., in their research, analysed Management (CogSIMA), 2022, pp. 112-114, doi:
the sales trend forecast of Bigmart using a machine learning 10.1109/CogSIMA54611.2022.9830670.
algorithm. Random Forest Classifier, Linear Regression and [4] Q. Shen, "A machine learning approach to predict the result of League
Decision Tree Classifier were used for this purpose. The data of Legends," 2022 International Conference on Machine Learning and
Knowledge Engineering (MLKE), 2022, pp. 38-45, doi:
was divided into 70 percent and 50 percent for training and 10.1109/MLKE55170.2022.00013.
testing. Overall the system produces an accuracy of 83 percent
for Random Forest Classifier. Odegua employed a machine
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on March 18,2024 at 23:16:38 UTC from IEEE Xplore. Restrictions apply.
[5] Li, Y., Pan, Y. A novel ensemble deep learning model for stock [12] Tableau.com.Online:
prediction based on stock prices and news. Int J Data Sci Anal 13, 139– https://www.tableau.com/sites/default/files/training/global_superstore
149 (2022). .zip. (accessed Aug. 29, 2022).
[6] H. Zhu, "A Deep Learning Based Hybrid Model for Sales Prediction of [13] Galton, F. (1886). "Regression towards mediocrity in hereditary
E-commerce with Sentiment Analysis," 2021 2nd International stature". The Journal of the Anthropological Institute of Great Britain
Conference on Computing and Data Science (CDS), Stanford, CA, and Ireland. 15: 246–263.
USA, 2021, pp. 493-497, doi: 10.1109/CDS52072.2021.00091. [14] R. Odegua, “Applied Machine Learning for Supermarket Sales
[7] Panjwani, Mansi, Rahul Ramrakhiani, Hitesh Jumnani, Krishna Prediction,” Project: Predictive Machine Learning in Industry, 2020.
Zanwar, and Rupali Hande. Sales Prediction System Using Machine [15] Fix, Evelyn; Hodges, Joseph L. (1951). Discriminatory Analysis.
Learning. No. 3243. EasyChair, 2020. Nonparametric Discrimination: Consistency Properties (PDF)
[8] N. Elias and S. Singh, “Forecasting of Walmart sales using machine (Report). USAF School of Aviation Medicine, Randolph Field, Texas.
learning algorithms,” Research paper, Dept. of Electronics & Comm. [16] Y. Li, J. Shi, F. Cao and A. Cui, "Product Reviews Analysis of E-
Engineering, BMS Inst. of Technology & Management, Bangalore, commerce Platform Based on Logistic-ARMA Model," 2021 IEEE
India, 2018. International Conference on Power, Intelligent Computing and
[9] Madhuvanthi, K. et al. (2019) 'Car sales prediction using machine Systems (ICPICS), 2021, pp. 714-717, doi:
learning algorithmns’, International Journal of Innovative Technology 10.1109/ICPICS52425.2021.9524238.
and Exploring Engineering, 8(5), pp. 1039–1050. [17] Quinlan, J. R. (1986). Induction of decision trees. Machine learning,
[10] Cheriyan, Sunitha, Shaniba Ibrahim, Saju Mohanan, and Susan Treesa. 1(1):81–106
"Intelligent Sales Prediction Using Machine Learning Techniques." In [18] Christoph Reinders, Bodo Rosenhahn, Learning convolutional neural
2018 International Conference on Computing, Electronics & networks for object detection with very little training data, (2019)
Communications Engineering (iCCECE), pp. 53-58. IEEE, 2018.
[19] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
[11] M. Korolev and K. Ruegg, “Gradient boosted trees to predict store
sales,” Personal communication, 2015. [20] Friedman, J. H. (2001). Greedy function approximation: a Gradient
boosting machine. Annals of statistics, pages 1189–1232.
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE SAO PAULO. Downloaded on March 18,2024 at 23:16:38 UTC from IEEE Xplore. Restrictions apply.