0% found this document useful (0 votes)
156 views26 pages

Customer Purchasing Behavior Prediction Using Machine Learning Classification Techniques

Uploaded by

Bishop Walter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
156 views26 pages

Customer Purchasing Behavior Prediction Using Machine Learning Classification Techniques

Uploaded by

Bishop Walter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/360048647

Customer purchasing behavior prediction using machine learning


classification techniques

Article  in  Journal of Ambient Intelligence and Humanized Computing · April 2022


DOI: 10.1007/s12652-022-03837-6

CITATIONS READS

0 148

4 authors, including:

Gyanendra Chaubey Dhananjay Bisen


HCL Rajiv Gandhi Proudyogiki Vishwavidyalaya
6 PUBLICATIONS   23 CITATIONS    11 PUBLICATIONS   110 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

My work View project

All content following this page was uploaded by Gyanendra Chaubey on 20 April 2022.

The user has requested enhancement of the downloaded file.


Journal of Ambient Intelligence and Humanized Computing
https://doi.org/10.1007/s12652-022-03837-6

ORIGINAL RESEARCH

Customer purchasing behavior prediction using machine learning


classification techniques
Gyanendra Chaubey1   · Prathamesh Rajendra Gavhane2 · Dhananjay Bisen3 · Siddhartha Kumar Arjaria1

Received: 20 August 2020 / Accepted: 28 March 2022


© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2022

Abstract
Many sales and service-providing companies need to talk up related customers while launching the new products, services,
and updated versions of existing products. While doing so, they need to target their existing customers. The behavior of these
customers gives companies information about how to sell products. This paper presents a comparative study of different
machine learning techniques that have been applied to the problem of customer purchasing behavior prediction. Experi-
ments are done using supervised classification machine learning techniques like logistic regression, decision tree, k-nearest
neighbors (KNN), Naïve Bayes, SVM, random forest, stochastic gradient descent (SGD), ANN, AdaBoost, XgBoost, and
dummy classifier, as well as some hybrid algorithms that use stacking like SvmAda, RfAda, and KnnSgd. Models are
evaluated using the cross-validation technique. Furthermore, the confusion matrix and ROC curve are used to calculate the
accuracy of each model. Finally, the best classifier is a hybrid classifier using the ensemble stacking technique (KnnSgd),
with an accuracy of 92.42%. KnnSgd gives the highest accuracy with maximum features because the error of the KNN and
SGD are minimized by the KNN at the end.

Keywords  Machine learning · Supervised classification algorithms · Customer purchasing behavior prediction ·
E-commerce

1 Introduction consolidate relationships with customers (Gupta and Aggar-


wal 2012). The data of customers are stored to achieve this
The customer is one of the most important assets and is the goal. Because of this, companies can analyze the purchasing
foundation for any business to be successful. The majority behaviors of customers and are able to make the best strate-
of startups and competitive companies invest in maintaining gies to sell products (Dawood et al. 2019).
good relationships with customers. Customer relationship In today’s modern era, online shopping is drastically
management (CRM) is a plan of action to manage, build and growing, and customers are connected through digital media
(Zhao et al. 2019). Many e-commerce websites focus on
* Gyanendra Chaubey attracting customers on various social media platforms.
[email protected] Usually, the customer is a one-deal hunter and will not buy
Prathamesh Rajendra Gavhane products again. Companies even have several strategies for
[email protected] advancing their products and services, but the costs will be
Dhananjay Bisen very high for a company to advertise to new buyers (Chara-
[email protected] nasomboon and Viyanon 2019). This will reduce the revenue
Siddhartha Kumar Arjaria sources. To alleviate this problem, it is necessary to dis-
[email protected] cover potential repeat buyers. To make the promotion more
effective, companies have started focusing on loyal custom-
1
Department of Information Technology, Rajkiya Engineering ers who are more likely to buy their products and services.
College, Atarra, Banda 210201, UP, India
Reaching out to loyal customers can be done by studying
2
Dr. D Y Patil School of Engineering and Technology, Charoli their purchasing behavior (Adebola Orogun 2019).
(BK), via Lohegaon, Pune 412105, Maharastra, India
Machine learning is effective at identifying the purchas-
3
Department of Information Technology, Madhav Institute ing behaviors of customers. According to available data,
of Technology and Science, Gwalior 474005, MP, India

13
Vol.:(0123456789)
G. Chaubey et al.

machine learning algorithms examine the patterns of cus- 2 Related works


tomers and provide target results. This makes it easy for
committees to form marketing strategies. Machine learning Customer purchasing behavior prediction is an important
algorithms are immensely efficient at predicting the pur- issue that has been discussed by many researchers, as the
chasing behaviors of customers. These algorithms include data of customers is now recorded by companies. Many
random forest tree, artificial neural network, support vector approaches have been applied to predict customer purchas-
machine, K-nearest neighbor, naive Bayes, logistic regres- ing behavior. Most of them used machine learning and data
sion, dummy classifier, AdaBoost, XgBoost, stochastic gra- mining techniques. Some of the recent work in this area is
dient descent, and hybrid algorithms (Kachamas et al. 2019). presented in the following paragraphs.
This paper aims to build a constructive model for pre- Cardoso (2012) presented studies to predict customer
dicting the purchasing behavior of a customer. Studies have purchasing behavior using logical discriminant analysis.
compared various machine learning techniques. In this The maximum accuracy achieved was 90.9% (using a deci-
work, data has been collected from UCI Machine Learning sion tree). In this view, Das (2015) built a prediction model
Repository. to identify which customers respond to a company’s offers.
The main contributions of this paper are as follows: Other studies have used algorithms (e.g., naive Bayes,
K-nearest Neighbor, support vector machine. The results
1 This work helps develop an understanding of customer show that the naive Bayes algorithm has the highest accu-
purchasing behavior analysis through a deep understand- racy (95%). Furthermore, Vafeiadis et al. (2015) used ANN,
ing of machine learning algorithms; it also provides SVM, decision tree, naïve Bayes, and logistic regression
steps to solve a machine learning problem. algorithms, along with their boosted versions, to predict
2 This paper represents a comparative study between dif- customer churn. This was done based on an understanding
ferent supervised machine learning techniques, includ- of the previous purchasing behavior of customers. Overall,
ing logistic regression, decision tree, KNN, naïve the best classifier was the boosted SVM (SVM-POLY with
Bayes, SVM, random forest, stochastic gradient descent, Adaboost), with an accuracy of 97%.
ANN, AdaBoost, XgBoost, dummy classifier, and some Similarly, author Santharam and Krishnan (2018) pre-
hybrid algorithms using stacking techniques, including sented a review of customer churn prediction. In this paper,
SvmAda, RfAda, and KnnSgd, which are applied to the 18 modeling techniques were discussed in various indus-
problem of customer purchasing behavior prediction. tries. Telecommunication is one industry where customer
3 Models are evaluated using the cross-validation tech- purchasing behavior has been studied in depth. Momin et al.
nique as well as the confusion matrix, and Receiver (2019) presented studies that aimed to accurately predict
Operating Characteristic (ROC) curves are drawn to customer churn. Different algorithms like logistic regression,
calculate the accuracy of the models. The analysis gives naïve Bayes, random forest, decision tree, K-nearest neigh-
an improvement in the results for classification of hotel, bor, and ANN were compared. ANN was the best model,
restaurant, and café (HoReCa) and Retail customer data- with 82.83% accuracy. Another rapidly growing industry
sets. is e-commerce. Many researchers have worked in this field.
4 Finally, this study indicates that the best classifier is a In this context, Kachamas et  al. (2019) proposed an
hybrid classifier using the ensemble stacking technique analytic tool for online vendors to predict the behavior of
(KnnSgd), with an accuracy of 92.42%. KnnSgd yields patrons according to the attention, interest, search, action,
the highest accuracy with maximum features because the and share (AISAS) model. The classification model was
error of the KNN and SGD are minimized by the KNN built using the naive Bayes method. The model achieved an
at the end. accuracy of more than 86%. Most customers who buy the
products on an online platform are one-time deal hunters. To
This paper is organized into six sections. The remainder of find loyal customers, Charanasomboon and Viyanon (2019)
the paper is organized as follows: Sect. 2 contains the related presented a solution for repeat buyer prediction. Random
published works. Section 3 presents a detailed description forest, gradient boost, XgBoost, and leave-one-out were
of the research procedure. The analysis of the results is pre- used to create the model. Random forest regressor with the
sented in Sect. 4, along with proof of the accuracies using help of leave-one-offer-out had better performance than oth-
different evaluation matrices. Section 5 concludes the paper, ers. Recently, (Do and Trang 2020) also presented studies
revisits the aims of this paper according to the results and predicting the purchase decisions of customers. Decision
observations, and discusses the future of customer purchas- tree, multilayer perception, naive Bayes, radial basis func-
ing behavior analysis. The references are provided in Sect. 6. tion (RBF), and SVM were used to analyze the sample data.

13
Customer purchasing behavior prediction using machine learning classification techniques

The results indicated that the decision tree outperformed the 3 Proposed procedure
other methods, achieving an accuracy of 91.67%.
Furthermore, Ullah et al. (2019) proposed a prediction The proposed work is done using wholesale customers’data
model to identify churn customers, along with the factors from the UCI-Irvin machine learning data repository. The
underlying the churning of customers. Different classifi- dataset has eight attributes, including channel (HoReCa and
cation methods were used to classify churn customers, in Retail customers) as the target class. The total number of
which a random forest (RF) algorithm performed well, with instances is 440 (298 HoReCa customers and 142 Retail
88% accuracy. Amin et al. (2018) used a just-in-time cus- customers) (Cardoso 2014). Various classification methods
tomer churn prediction (JIT-CCP) model to provide a com- are used to predict the class of the customers. The whole
parison and effect of state-of-art data transformation. They procedure of analysis is divided into the following steps is
conducted three experiments to calculate the performance shown in Fig. 1:
of the JIT-CCP model. First, JIT-CCP without data transfor-
mation achieved 49.9% accuracy, JIT-CCP on long method 1 Gathering the data
achieved 49.1% accuracy, and at last used rank method with 2 Preprocessing the data
JIT-CCP is used to achieve 59.3% accuracy. 3 Feature selection
Alloghani et  al. (2018) discussed the applications of 4 Fitting the model
machine learning and its applications in software engineer- 5 Measuring the model’s accuracy
ing learning and predicting the performance of students.
They have used seven techniques and two datasets. The neu- 3.1 Data preprocessing
ral network gave the best accuracy for the first dataset, and
random forest gave the best accuracy for the second dataset. Preprocessing is a step in data science to clean, transform
Likewise, Liu et al. (2018) addressed the problem of data and reduce the data to better fit the model. There are various
sparsity and studied temporal factors systemically. In this methods of data preprocessing:
work, POI prediction was done to combine temporal and
spatial factors in a Euclidean space. On this basis, a DME-
TS model was proposed. It was based on metric embed-
ding that constructed a model that was more flexible for
various contextual factors. Finally, the authors evaluated
performance with other models (e.g., BPR-MF, PRME-G,
WMF, WWO, and ST-RNN) and observed improved results
on benchmark datasets such as Foursquare and Gowalla.
After that, a machine learning-based method was pro-
posed for conserving position confidentiality of roam-
ing PBS users. The authors identified user position by
combining decision tree and K-nearest neighbor learning
techniques. In this method, the hidden Markov model was
applied to estimate users’destinations, including their posi-
tion track sequence. Finally, this work was evaluated 90%
of position confidentiality in PBS (Sangaiah et al. 2019).
In 2018, researchers proposed a class-specific extreme
learning machine (CS-ELM) to handle the binary class
imbalance problem using class-specific regularization
parameters. This method yielded a lower computational
overhead and higher accuracy than a previous weighted
extreme machine learning model in 38 different datasets
taken from the KEEL dataset repository (Raghuwanshi and
Shukla 2018).
Based on previous works, this paper describes every algo-
rithm in-depth, along with geometrical and mathematical
intuition of machine learning classification techniques. In
this work, customer behavior has been analyzed as either
Retail or hotel, restaurant, and cafe (HoReCa).
Fig. 1  Procedure of analysis

13
G. Chaubey et al.

1 Data cleaning s = Sample deviation


2 Missing data 𝜎 = Population standard deviation
3 Noisy data
4 Data transformation 3.3 Fitting the model
5 Normalization
6 Attribute selection The detailed steps of the machine learning classification
7 Discretization techniques used in the proposed work are discussed here.
8 Data reduction
9 Aggregation 3.3.1 Logistic regression
10 Dimensionality reduction
To understand logistic regression algorithms, one needs to
In this work, standard scaling has been done to transform focus on two important aspects of these algorithms:
the data. The standard score of sample x is given as:
x−𝜇 1 Geometric intuition
z= 2 Mathematical intuition
s
where 𝜇 is the mean of samples and s is the standard
deviation. 1 Geometric intuition: The logistic regression classifi-
cation classifies data using a plane. This classification
technique is applied where the points are linearly sepa-
3.2 Feature selection rable. Hence, logistic regression is applied for binary
classification problems (Chaubey et al. 2020). Imagine
Feature selection is a method to select the best features a plot, as shown in Fig. 2, with nine stars and nine tri-
that can contribute maximum in training of the model. angles. A plane is drawn passing through (0, 0). The
The methods that are used in the feature selection are: classification plane classifies the triangles and stars with
two misclassification points for each class.
1 Filter method 2 Mathematical intuition: The mathematics underlying the
2 Pearson correlation line drawn through the features (x1, x2, x3, … ) and the
3 Linear discriminant analysis output variable Y is given as
4 Analysis of variance (ANOVA)
y = mx + c
5 Chi-square test
6 Wrapper method (OR)�(x) = 𝜃0 + 𝜃1 (x)
(1)
7 Forward selection (OR)y(x) = wT x + c
8 Backward elimination (OR)y(𝛽) = 𝛽0 + 𝛽1 x
9 Recursive feature elimination
10 Embedded method Since the line passes through (0, 0), Eq. (1) is converted
into
In the present work, a chi-square test is used for feature
selection. The general formula for the chi-square test for
categorical data is:

(O − E)2
X2 =
E
where:
O = Observed frequencies
E = Expected frequencies
For numerical data, the formula is

(n − 1)s2
X2 =
𝜎2
where:
n = Sample size
Fig. 2  Representation of the plane of a logistic regression

13
Customer purchasing behavior prediction using machine learning classification techniques

y(x) = wT x (2) y = −1

Eq. (2) can also be understood as the distance between


a point and a plane, which is given as wT x < 0

wT x + b y ∗ wT x > 0
d= (3)
‖w‖
This shows that the point is correctly classified.

Since the plane passes through (0, 0) and because the plane From the above three cases, it can be concluded that obtain-
is a unit vector (i.e.,||w|| = 1), Eq. (3) can be represented as ing a correctly classified point maximizes the cost function:
d = wT x ( )
∑( )
If there are multiple points above and below the plane, the max yi ∗ wTi xi
points above the plane are considered positive points, while i

those below the plane are considered negative points. Hence,


where yi and xi are given and wTi is updated as the weight.
the collective distance of all the points are given as
Many hypothesis planes can classify data; which is best

y= wTi xi depends on the maximum value of the cost function.
i
The plane of the hypothesis varies with the outliers. Some-
Assume that y+ = +1 (for above plane points), y− = −1 (for times, due to outliers, even the best-fit plane is skipped and
below plane points) the worst fit is selected. The cost function is passed through
Now, consider Fig. 2, for which there are three different a sigmoid function to remove the effects of outliers:
cases:
Case 1 In the first case, the (positive) point is a star (y = 1) ( )

lying above the plane. Therefore, the distance of the point is f yi ∗ wTi xi
also positive. Thus, the cost function will be positive, as shown i
in the equations below.
where f is the sigmoid function.
y = +1
Now, let
wT x > 0

z= yi ∗ wTi xi
T
y∗w x>0 i

This shows that the point is correctly classified. Then, the sigmoid function is
Case 2 In the second case, the (negative) point is a triangle 1
(y = −1) lying above the plane-thus, the distance of the point h(x) =
1 + e−z
is positive. Hence, the cost function will be negative, as shown
in the equations below. The mutual probability of prediction of a class is given as

y = −1 p(y ∣ x) = h(x)y (1 − h(x))1−y (4)

When y = 1 in Eq. (4),


wT x > 0
p(y = 1 ∣ x) = h(x)
T
y∗w x<0 When y = 0 in Eq. (4),
This shows that the point is not correctly classified. p(y = 0 ∣ x) = 1 − h(x)
Case 3 In the third case, the (negative) point is a triangle
(y = −1) lying below the plane. Hence, the distance of the Our concern is estimating the probability of y given x,
point is negative. Thus, the cost function is positive, as shown parameterized by wT -that is, py (x;wT ) of p(y|x). So, to cal-
in the equations below. culate the optimized value of wT  , the first step is to calculate
the log-likelihood of P(y|x) (Dreiseitl and Ohno-Machado
2002), which is calculated as

13
G. Chaubey et al.

myi
∑ Table 1  Example data for a decision tree
l(wT ) = log h(xi ) + (1 − yi ) log(1 − h(xi ))
i
Ex. Purchasing Revisiting Banking Target

Then, the stochastic gradient descent algorithm for (x, y) is D1 High Less Increased HoReCa
calculated as D2 High Less Medium Retail
D3 Low More Medium Retail
wT = wT + 𝛼Δ(wT ) l(wT ) (5) D4 High Less Medium HoReCa
D5 Low Less Increased Retail
When performing the partial differentiation of l(wT ) ,
Δ(wT ) l(wT ) is given as

Δ(wT ) l(wT ) = (y − h(wT ) (x))xi 1 Information gain method: Let us first calculate the
entropy of the target variable. The entropy is given as:
Hence, putting the value of Δ(wT ) l(wT ) into Eq. (5), the
updated weight is given as E(s) = Σi pi log2 pi (6)

wTi = wTi + 𝛼(y − h(wT ) (x))xi After expanding Eq. (6) for the binary class, it is given
as
where w is updated until the cost function is maximized.
E(s) = −p− log2 p− − p+ log2 p+
In the proposed work, a logistic regression classifier
is imported from the linear models of the sklearn library. Hence, in the example above, HoReCa (h) = 2 , Retail
(r) = 3 , and entropy is

3.3.2 Decision tree E(s) = −ph log2 ph − pr log2 pr


2 2 3 3
E(s) = − log2 − log2 = 0.970
The decision tree is an easy-to-use technique employed to 5 5 5 5
reach conclusions following specific conditions. A deci-
For further analysis, the transformations of the HoReCa
sion tree contains two types of nodes (Rokach and Maimon
class to a positive class and the Retail class to a nega-
2005):
tive class are done. Now, the information gain of each
attribute is calculated. The information gain of a sample
1 Decision nodes and
S over an attribute A is given by
2 Leaf nodes
∑ | Sv |
Gain(S, A) = Entropy(S) − | | Entropy(S )
The decision tree uses a top-down approach. The decision |S| v
node denotes which attribute has to be selected, and the v∈ values (A)

Let us calculate it for all three attributes

| |
|S High | ) ||S Low ||
Gain(S, Purchasing) = Entropy(S) − | | ∗ Entropy(S ( )
High − ∗ Entropy S Low
|S| |S|

( )
leaf node denotes the class. The primary decision node is Entropy S High = 0.918
known as the root node. Each decision node is selected
based on two popular methods (Hehn et al. 2019): ( )
Entropy S Low = 0

1 Information gain method


2 Gini index method 3 2
Gain(S, Purchasing) = 0.970 − ∗ 0.918 − ∗ 0
5 5
A small dummy example will be considered to explain
both methods in Table 1. Here, the target classes of the Gain(S, Purchasing) = 0.419
variables are HoReCa and Retail.
Similarly,

13
Customer purchasing behavior prediction using machine learning classification techniques


Gini node (N) = 1 − (1 − p(t))2
(7)
t∈ target values

The calculation of Gini over split on attribute A is done


as follows:
| |
∑ |Sf |
Gini node (A) = | | Gini(N )
f
Fig. 3  Drawing the root node f ∈ featureValues
|S|

3.3.3 K‑Nearest neighbor

K-nearest neighbor classifier is called an“instance-based


learner”as it stores instances. The K-nearest neighbor clas-
sifier learning is based on the analogy that it stores the train-
ing data. When a query point is given, it searches for the
K-nearest neighbors closest to the query point and assign
the query point a class as that of the majority class votes of
neighbors. Since it waits for the query point, it is also called
a“lazy learner”. The value of“K”indicates the number of
Fig. 4  Decision tree for the data neighbors taken into account (Adeniyi et al. 2016; Zhang
2016).
The formulae used to calculate the distance of nearest
neighbors are
Gain(S, Reserving ) = 0.170
Gain(S, Banking ) = 0.549 1 Euclidean distance

Since the Gain(S, Banking) is highest, it is selected √ n
√∑ ( )2
as the root attribute, as shown in Fig.  3. Now, for √ xi − yi (8)
choosing the other attributes to the decision node on i=1
Banking‘increased’side, calculate the following:
( ) 2 Manhattan distance
Gain S increased , Purchasing = 1.97 n
( ) ∑
Gain S increased , Reserving = 1.72 |x − y |
| i i|
i=1
Choosing the other attributes of the decision node on
the Banking “medium”side, calculate the following: 3 Minkowski distance
( ) ( n )1∕p
Gain S medium , Purchasing = 1.88 ∑
( ) |xi − yi |p
Gain S medium Reserving = 1.63 | |
i=1

Since Gain is high for purchasing for both the values In the above equation, p can be any real value, though it
of Banking, the tree can be drawn as shown in Fig. 4. is typically set between 1 and 2. For values of p less than
The algorithm applied to draw this tree is the ID3 algo- 1, the formula does not define a valid distance metric
rithm. Similarly, in the proposed work, the decision tree 4 Hamming distance: Hamming distance is used in the
is drawn and evaluated using the sklearn library. case of categorical values:
2 Gini index method: The Gini index is also used to cal- n
culate the gain using Eq. (7) and draw the tree. The Gini ∑
DH = |xi − yi |
value on the node is calculated as (i=1)

13
G. Chaubey et al.

Table 2  Example data for KNN Instances F1 F2 Y Table 3  Example data for Naïve Bayes
Purchasing Revisiting Banking Client
I1 6 4 1
I2 2 3 0 High Less Increased HoReCa
I3 4 5 1 High Less Medium Retail
I4 3 6 0 Low More Medium Retail
I5 5 8 1 High Less Medium HoReCa
Low Less Increased Retail
Low More Increased HoReCa
Low More Medium Retail
x=y→D=0
High Less Medium Retail
High More Increased HoReCa
x≠y→D=1 Low Less Increased HoReCa

In this work, Euclidean distance is used. An example is pro-


vided here to explain the KNN algorithm better. Suppose
the data in Table (2) is fitted to the model. In this case, F1 Table 4  Probability table for purchasing
and F2 are features, and Y is the target class, with classes Purchasing Client=HoReCa Client=Retail Client
of 0 and 1.
High 3/5 2/5 HoReCa
Low 2/5 3/5 Retail
Suppose an unknown query point:[2, 7], K = 3 . Now
calculate the distance using Eq. (8) of all the points from
the query point, taking the distance of each point as
I1, I2, I3, I4, and I5. Now calculate I1 first: Table 5  Probability table for revisiting
� √ √ Revisiting Client=HoReCa Client=Retail Client
I1 = (x1 − x2 )2 + (y1 − y2 )2 = (6 − 2)2 + (4 − 7)2 = 25 = 5
More 2/5 2/5 HoReCa
Similarly, calculate the other distances: Less 3/5 3/5 Retail

I2 = 4

I3 = 2 2 = 2.82
Table 6  Probability table for banking
√ Banking Client=HoReCa Client=Retail Client
I4 = 2 = 1.41
Increased 4/5 1/5 HoReCa
√ Medium 1/5 4/5 Retail
I5 = 10 = 3.16

The three nearest points are I4, I3, and I5. Now, the class
of each point (I3, I4,  and I5) are 1, 0, and 1. Here class 1
makes up the majority. Hence, the query point is predicted
P( X ) ∗ P(Y)
in class label 1. Similarly, the K-nearest neighbor classi- Y
P( ) = Y
fier of the sklearn library is used to make predictions in the X P(X)
proposed work.
This implies that

Y Y X … Xn
3.3.4 Naïve Bayes P( ) ∝ P( ) ∗ P(Y) = P( 1 ) ∗ P(Y)
X X Y

Naïve Bayes is a classification technique that uses the In above equation, the difficulty is in calculating the joint
Bayesian theorem to predict the class of a new feature set probability. Hence, the Naïve Bayes classification is used. It
(Sánchez-Franco et al. 2019). assumes that all the features are conditionally independent
The Bayesian formula is given as (Kaviani and Dhotre 2017):

13
Customer purchasing behavior prediction using machine learning classification techniques

P(
X1 … Xn X X X
) = P( 1 )P( 2 ) … P( n ) Since P( Retail
X
) > P( Horeca
X
) , label Xnew is Retail. In the pro-
new new
Y Y Y Y posed work, the naïve Bayes classifier is imported from the
Hence, the probability for Xnew =< X1 … Xn > is calculated sklearn library to learn and evaluate the model.
as (Eq. (9))
∑ Xi
Y 3.3.5 Support vector machines
P(
X1 … Xn
) = P(Y) ∗ P( )
Y (9)
i
Support vector machines are one of the most effective algo-
Now, the example presented in Table (3) is considered to rithms to solve linear problems (Nalepa and Kawulok 2018).
explain the underlying mathematics better. Purchasing, Although logistic regression also classifies linear problems,
Revisiting, and Banking are the features, and the client is SVMs use the concept of support vectors to perform linear
the target. separation. It has a clever way to reduce overfitting and can
When the data is given to the model, it learns the prob- use many features without requiring too much computation.
ability of each feature, shown in Tables (4), (5), (6).
P(Client = Horeca) = 5∕10 1 Mathematical intuition
P(Client = Retail) = 5∕10
Consider Fig. 5, which is a hypothetical example of a data-
Now, a new tuple is given as set that is linearly separable, with a decision boundary
drawn as a solid line as a plane with two dotted lines serv-
Xnew = (Purchasing = High, Revisiting = More, Banking = Medium)
ing as the positive and negative planes. The green stars
Now, predict the label for Xnew (see the table of the learning are considered as positive points, while the orange circles
phase). are negative points. The equation of the positive plane is
given as wT x + b = +1 , the equation of the negative plane
P(Purchasing = High|Client = Horeca) = 3∕5 is given as wT x + b = −1 , and equation of the hypothesis
P(Purchasing = High|Client = Retail) = 2∕5 plane is given as wT x + b = 0.
P(Revisiting = More|Client = Horeca) = 2∕5 The positive and negative planes change as the hypoth-
esis plane changes. Here, this is explained by taking
P(Revisiting = More|Client = Retail) = 2∕5
points that are linear separable; however, in the real world,
P(Banking = Medium|Client = Horeca) = 1∕5 the data can be non-separable. The distance from the
P(Banking = Medium|Client = Retail) = 4∕5
P(Clinet = Horeca) = 5∕10
P(Clinet = Retail) = 5∕10

Now, decide with the naïve Bayes rule using Eq.(9):

Horeca Purchasing = High Revisiting = More


P( ) = P( ) ∗ P( )
Xnew Client = Horeca Client = Horeca
Banking = Medium
∗ P( ) ∗ P(Client = Horeca)
Client = Horeca

Horeca 3 2 1 5
P( ) = ( ) ∗ ( ) ∗ ( ) ∗ ( ) = 0.024
Xnew 5 5 5 10

Retail Purchasing = High Revisiting = More


P( ) = P( ) ∗ P( )
Xnew Client = Retail Client = Retail
Banking = Medium
∗ P( ) ∗ P(Client = Retail)
Client = Retail

Retail 2 2 4 5 Fig. 5  Representation of a decision boundary in SVM


P( ) = ( ) ∗ ( ) ∗ ( ) ∗ ( ) = 0.064
Xnew 5 5 5 10

13
G. Chaubey et al.

hypotheses plane to the nearest positive point or positive 𝜕Lp n



plane is given by d+ and the distance from the hypotheses 𝜕w
=0→w= 𝛼i yi xi
plane to the nearest negative point or negative plane is i=1

given by d−.
n
Hence the equation for the positive plane is given as 𝜕Lp ∑
=0→ 𝛼i yi = 0
wT d+ + b = +1 and that of the negative plane is given as 𝜕b i=1
wT d− + b = −1 . Now, solving both equations will get
m m m
wT (d+ − d− ) = 2 ∑ 1∑ ∑
minw Lp (w, b, 𝛼i ) = 𝛼i − 𝛼i 𝛼j yi yj (xiT xj ) − b 𝛼i yi
i=1
2 i,j=1 i=1
After dividing both sides by ||w||, the following equation is
derived:
m m
∑ 1∑
w T 2 minw Lp (w, b, 𝛼i ) = 𝛼i − 𝛼 𝛼 y y (xT x )
(d − d− ) = 2 i,j=1 i j i j i j
||w|| + ||w|| i=1

Now maximize, The following dual optimal problem is then obtained:


m m
2 ∑ 1∑
max𝛼,𝛽,𝛼i ≥0 minw Lp (w, b, 𝛼i ) = 𝛼i − 𝛼 𝛼 y y (xT x )
||w|| i=1
2 i,j=1 i j i j i j

given (w, b), such that ∑


s.t. 𝛼i ≥ 0, i = 1, … , k and m 𝛼 y = 0.
i=1 i i
This is a quadratic programming problem. The sequential
yi ∗ (wT xi + b) ≥ 1 (10)
minimal optimization (SMO) is used to solve the dual prob-
If Eq. (10) is valid, then the classification is true; otherwise, lem. A global maximum of 𝛼i can always be found.
it is a misclassification (Sweilam et al. 2010). The Lagrange multipliers 𝛼i reconstruct the parameter
For optimization, perform the minimization of ||w||
2
 . vector was a weighted combination of training examples
Lagrange duality is used to get the dual form. This allows
2
as follows:
an efficient algorithm to be derived to solve the above ∑
w= 𝛼i yi xi
optimization.
i∈SV
Let us understand the Lagrange duality. A problem is
given as To test this with new data z, compute

minw f (w)s.t.gi (w) ≤ 0, i = 1, … , k and hi (w) = 0, i = 1, … , l wT z + b = 𝛼i yi (xiT z) + b
i∈SV
Then, generalized Lagrangian is given as
and classify z as class 1 if the sum is positive and as class −1
k l
∑ ∑ is the sum is negative.
L(w, 𝛼, 𝛽) = f (w) + 𝛼i gi (w) + 𝛽i hi (w)
Two more terminologies are added c1 and 𝜉i to optimize
(i=1) (i=1)
and generalize this function:
where the 𝛼 ′ s and 𝛽 ′ s are the Lagrange multipliers. The solu- ( )
m
tion to the dual problem is represented as ||w||2 ∑
min + c1 𝜉i
2
d∗ = max𝛼,𝛽,𝛼i ≥0 minw L(w, 𝛼, 𝛽) i=1

where c1 is“how many errors my model can consider,”𝜉i is


According to Lagrange duality,
the“value of the error,”and c1 is regularization.
||w||2 ∑
k This is all about Hard-SVM means strictly classifying
minimize Lp (w, b, 𝛼i ) = − 𝛼i (yi ∗ (wT xi + b) − 1)s.t.𝛼i ≥ 0 the points into linear separation. However, when there are
2 (i=1)
nonlinear separable functions, kernel tricks are used.
Now minimize Lp concerning w and b for fixed 𝛼 . Taking a
partial derivative of Lp concerning w and b, the following
occurs:

13
Customer purchasing behavior prediction using machine learning classification techniques

Fig. 7  Architecture of a neural network

Fig. 6  Random forest classification include its efficiency and ease of implementation (Bottou
2010). In this work, a linear support vector machine is used
as a classifier, and a gradient descent algorithm is applied
3.3.6 Random forest to optimize the results.

Random forest is an ensemble technique, which is an 3.3.8 Artificial neural network


aggregation or combination of several base models. The
ensemble technique is of two types: bagging and boosting. The artificial neural network is an imitation of the human
Bagging is also called bootstrap aggregation. Feeding the brain. An artificial neural network consists of processing
data to the base models by row sampling with replace- units called neurons. A neuron consists of inputs (dendrites)
ment and predicting the classes is called bootstrapping, and an output (synapse by axon) (Agatonovic-Kustrin and
and aggregation is the result based on the majority vote of Beresford 2000).
base models on the test data (Kohavi et al. 2004). Random Artificial neural networks consist of input, output, and
forest is a bagging technique that uses the decision tree hidden layers. Each layer consists of nodes. A node is a sim-
as a base model (Breiman 2001). It applies both feature ple neuron with two functions:
sampling and row sampling with replacement to feed the
data to the base models (Ali et al. 2012). Fig. 6 presents 1 Summation of weights and input values and
an example of random forest classification. 2 Activation functions.
Suppose a training dataset that is classified as 0 or 1—
that is, binary classification is given to different decision When an input is given to an ANN, random weights are
tree models with the feature sampling and row sampling associated with the inputs, and a linear combination (sum-
with a replacement. Then, the results exhibited by the mation) is calculated. Then, it is passed to the activation
decision trees are given as shown in Fig. 6. When a test function for the output to the next layer (Bala and Kumar
dataset is passed through, the results of the decision trees 2017). A multilayer neural network can solve complex prob-
aggregate using the majority voting method to predict the lems. The architecture of a neural network with two hidden
final class. layers is shown in Fig. 7.
When a decision tree is used alone to classify a dataset, The dense architecture (Fig. (7)) is designed for the two
it has low bias and high variance when the tree is grown to best input features as input layers with two nodes. Two
the maximum depth. Feature sampling and row sampling are hidden layers, with four and two nodes, have been used,
used with different decision tree models to reduce variance. and an output layer with a single node is present, as there
are two class labels.
3.3.7 Stochastic gradient descent classifier For the single unit of the neuron, the summation is
given as
Stochastic gradient descent (SGD) is an efficient technique
n
for linear classification problems under the convex loss func- ∑
wi xi
tions, such as (linear) support vector machine and logistic i=0
regression. SGD is merely an optimization technique and
does not correspond to a specific set of machine learn- where w0 is the bias and x0 = 1 is fixed input.
ing algorithms. Advantages of stochastic gradient descent

13
G. Chaubey et al.

After the linear combination of weight and input, it Table 7  Dataset for Adaboost F1 F2 F3 Y
goes through the nonlinear activation functions, such
as a sigmoid function, ReLU function, or softmax func- X11 X12 X13 Y+
tion. A perceptron is trained with a gradient descent X21 X22 X23 Y-
algorithm. X31 X32 X33 Y-
A multilayer neural network is trained with the back- X41 X42 X43 Y+
propagation algorithm. Let x1 , x2 , x3 , … , xn be the input and X51 X52 X53 Y+
y1 , y2 , y3 , … , yn be the output values. The predicted val-
ues corresponding to each input value are ŷ1 , ŷ2 , ŷ3 , … , ŷn .
Weight initialization is done with small random numbers.
For the one output neuron, the error is given as
1 3.3.9 AdaBoost
E= (y − ŷ )2 (11)
2
AdaBoost is an ensemble technique and a boosting algo-
For each node j, the output ôj is defined as rithm. It combines the weak learners or classifiers to improve
( n ) their performance (Schapire 2013). Each learner is trained

ôj = �(netj ) = � wkj ôk with a simple set of training samples. Each sample has a
k=1 weight, and the sample of all the weights is adjusted itera-
tively. AdaBoost iteratively trains each learner and calculates
The netj is the weighted sum of the outputs ôk of the previ- a weight for each one, and each weight represents the robust-
ous n neurons. ness of each weak learner. Here, a decision tree is used as
Finding the derivative of Eq. (11), that is an error, with a base learner (Freund and Schapire 1999). The AdaBoost
respect to weight: algorithm has three main steps:

𝜕E 𝜕E 𝜕oj 𝜕netj • Sampling: In this step, some samples Dt are selected from
=
𝜕wij 𝜕oj 𝜕netj 𝜕wij the training set, where Dt is the set of samples in the
iteration t.
( )
𝜕E ∑ 𝜕E 𝜕ol • Training: In this step, different classifiers are trained
= w �(netj )(1 − �(netj ))oi = 𝛿j oi using Dt , and the error rate ∈i for each classifier is calcu-
𝜕wij l
𝜕ol 𝜕netl jl
lated.
Where • Combination: Here, all trained models are combined.

𝜕E 𝜕oj
𝛿j = = (oj − yj )oj (1 − oj )
𝜕oj 𝜕netj 1 Mathematical intuition

if j is an output neuron, and Consider a dataset (Table 7) comprising five samples, where
∑ F1, F2, and F3 are features and Y is the output.
𝜕E 𝜕oj
𝛿j = = ( 𝛿zl wjl )oj (1 − oj )
𝜕oj 𝜕netj z Step1 :  Assign the same weights to each record. This is
if j is an inner neuron. called sample weights and is given by
To update the weight wij using gradient descent, one must 1
choose a learning rate 𝜂. w=
n

Δwij = −𝜂
𝜕E where n is the number of records. Hence, each record is
𝜕wij assigned a weight of 15.
In this algorithm, the tree should not be created as in
Hence, weight updating is done as follows:
a random forest. Instead, the decision tree is constructed
wij ← wij + Δwij up to one depth (called stumps) for each feature. From
these stumps, select the first base learner model using the
entropy or Gini index. The least entropy stump is selected
Δwij = 𝜂𝛿j xij
as the first base learner model.

13
Customer purchasing behavior prediction using machine learning classification techniques

Step2 :  After the first base learner model, suppose it Suppose three decision trees are formed. In this case, test
truly classifies three samples, two of which are miscalcu- data are passed, and classifications are obtained as Y+ , Y− ,
lated. Then, the total error is calculated by summing all and Y+ . Then using the majority voting method, the output
the misclassified sample weights. Here, two samples are is Y+.
misclassified as having a weight of 15.
Then, Total error
3.3.10 XgBoost
2
(TE) = = 0.4
5
XgBoost is an ensemble technique and a boosting algorithm.
Step3 :  Calculate the performance of the stump. The for- It uses the decision tree as a base model. The main advan-
mula for this is tages of XgBoost are that it is scalable and that it uses par-
( ) allel and distributed computing to offer memory-efficient
1 (1 − TE) usage (Chen and Guestrin 2016). Steps in a Gradient Boost-
Performanceofstump = loge (12)
2 TE ing Algorithm:
Putting the values into Eq.(12) yields
1 Take the first output ŷ as a constant based on a base clas-
( ) sifier.
1 (1 − 0.4)
Performanceofstump = loge = 0.20 2 Compute the errors or residual (R1 ) using any loss func-
2 0.4
tion.
Step4 :  Update the weights of the misclassified points to 3 Construct first decision tree with the features and resid-
pass the samples to the next sequential base learner. ual (R1 ) as the target variable. After getting the predic-
tions from this decision tree, consider (R2 ) as the new
Newweight = oldweight ∗ ePerformanceofstump (13) residual such that R2 < R1.
4 To compute the final result, combine both the base
Putting the values in Eq. (13) yields
classifier and the decision trees sequentially. However,
1 the direct combination of a decision tree with the base
Newweight = ∗ e0.20 = 0.24
5 model will overfit the whole classifier. Hence, a learning
rate is introduced in the combination as follows:
This updated weight is greater than 15.
Decrease the weight of the correctly classified points y(x) = Obase (x) + 𝛼ODT1 (x)
as follows:
, where alpha is the learning rate.
Newweight = oldweight ∗ e−Performanceofstump (14) 5 If y(x) is significantly different from the actual output,
then add another tree sequentially based on input fea-
Putting the value to the Eq.(14) yields tures and residual R2 as the target variable.
1
Newweight = ∗ e−0.20 = 0.16. Suppose n decision trees are required. Then, the final output
5
is given as
Hence, the updated weight is 0.16 for the three correctly
classified points and 0.24 for the two misclassified points. h(x) = h0 (x) + 𝛼1 h1 (x) + 𝛼2 h2 (x) + … 𝛼n hn (x)
The sample weights have a sum of 1. Hence, the updated
or
weight has to be normalized. The next step is to sum all the
updated weights as follows: n

h(x) = h0 (x) + 𝛼i hi (x)
0.16 ∗ 3 + 0.24 ∗ 2 = 0.96 i=1

Divide all the weights by 0.96 to get the normalized weights.


The weight of the correctly classified points is 16 = 0.167 , 3.3.11 Dummy classifier
and that of misclassified points is 14 = 0.25.
Based on the normalized weights, weights are now A dummy classifier is a classifier that makes predictions
divided into buckets. A new dataset is created for the next using a simple rule. It is used as a baseline to check the
sequential learner by taking the random bucket values and accuracy of another classifier. It is not used to solve real
using the corresponding samples in the new dataset. Based problems.
on the new dataset, new stumps are created; this process is
repeated until the last decision tree is created.

13
G. Chaubey et al.

Table 8  Confusion matrix
Actual class

Predicted class Class1 Class2


Class1 TP FP
Class2 FN TN

Fig. 8  Architecture of SvmAda 3.3.12 Stacking

Stacking is an ensemble technique that combines the predic-


tions of many machine learning models in the same data-
set (Džeroski and Ženko 2004). Stacking uses a two-level
architecture:

1 Base Level (0- Level): The base-level architecture con-


sists of the base machine models, of which some features
are to be combined.
Fig. 9  Architecture of RfAda 2 Meta Level (1-Level): The meta-level architecture con-
sists of the machine learning model that learns how to
best combine the predictions of the base model

Base models are trained on the training dataset, while the


meta model is trained based on the predictions of the base
model. Hence, the output of the base model works as an
input for the meta model.
In this work, stacking is done to produce three new clas-
sifiers. In all algorithms, KNN is used as the meta model,
while the base models are changed. Here, all models are
Fig. 10  Architecture of KnnSgd discussed separately:

3.3.12.1  SvmAda  This classifier consists of the two base


classifiers: linear support vector machine and AdaBoost.
A dummy classifier does a simple sanity check for the The outputs of the base models are then given to the KNN
supervised learning models by comparing them with simple to predict the final output. The model consists of the two
rules. A few strategies for implementing the dummy clas- base models: SVM and AdaBoost. Hence, it is called
sifier are: SvmAda. The architecture is shown in Fig. 8 below:

1 Stratified: Generates a random set prediction by seeing 3.3.12.2  RfAda  A random forest classifier and AdaBoost
the training set class distribution. worked as the base model, and KNN is used as a meta
2 Mostfrequent: Always predicts the most frequent label model. Due to the base models, the model is named as
in the training set. RfAda (random forest and AdaBoost). The architecture of
3 Prior: Always predicts the class that maximizes the prior the model is given in Fig. 9:
class.
4 Uniform: Generates predictions uniformly at random. 3.3.12.3  KnnSgd  The base classifiers of this model are
5 Constant: Always predicts the constant label provided KNN and SGD, this is why it is named as KnnSgd. The
by the user. meta model is KNN. The architecture of KnnSgd is shown
in Fig. 10:
In this paper, the stratified method is used.

13
Customer purchasing behavior prediction using machine learning classification techniques

Fig. 11  Visualization of data

Fig. 13  ROC curve for logistic regression

Fig. 12  Confusion matrix of logistic regression

Fig. 14  Confusion matrix of decision tree


3.4 Calculation of accuracy

Accuracy is calculated based on the confusion matrix (Table TP


(8)). The confusion matrix is a table that is used to calculate TPR =
TP + FN
the accuracy, represented as:
TN
TP + TN FPR =
Accuracy = (15) TN + FP
TP + FN + FP + TN
ROC is also known as the receiver operating characteris-
Where: TP: True positive, TN: True negative, FN: False tic curve because it compares two operating characteristics
negative, FP: False positive (TPR and FPR) (Lavrač et al. 2004).
The ROC curve is drawn between the true positive rate
(TPR) and the false-positive rate (FPR) to illustrate the abil-
ity of a binary classifier.

13
G. Chaubey et al.

Fig. 15  ROC curve for decision tree

Fig. 17  Confusion matrix for KNN

Fig. 16  Confusion matrix for KNN


Fig. 18  ROC curve for KNN

4 Result and analysis

Each algorithm has been implemented using common steps. 4.1 Logistic regression classifier
Feature selection is done using the chi-square test. The two
best features of Grocery and Detergents paper were obtained. A logistic regression classifier classifies data based on the
The visualization of the data across the two best features is sigmoid function. Accuracy is calculated based on select-
shown in Fig. 11. ing the features. When evaluating the accuracy of the two
The data is divided into two parts: best features, the obtained result is 87.88%. The selection
of the features is increased one by one to increase the per-
1 Training set (70%) formance of the model. When selecting all the features, the
2 Test set (30%) highest accuracy of this model is 89.39%. The confusion
matrix is shown in Fig. 12.
The analysis result for each algorithm is implemented below. Accuracy is calculated using Eq.(15):

13
Customer purchasing behavior prediction using machine learning classification techniques

TP + TN
Accuracy =
TP + FN + FP + TN

83 + 35 118
Accuracy = == = 0.8939.
83 + 11 + 3 + 35 132
Hence, the accuracy obtained is 89.39%. The confusion
matrix can also be used to get the precision, recall, and f1
score. A low false positive value can be seen in the confu-
sion matrix of the logistic regression model.
The ROC curve regarding accuracy is shown in Fig. 13.
The ROC curve area is 0.97. The ROC curve represents the
relation between the false positive and true positive values.

Fig. 20  ROC curve for Naïve Bayes


4.2 Decision tree

Initially, grocery and detergents paper were selected as 4.3 KNN


features for making decisions. Then, adding Milk and Fro-
zen increased the performance of the model. Finally, all To prepare the KNN model, the dataset is divided into a
features were added, and the algorithm showed an accu- training set (70%) and a test set (30%). The accuracy of the
racy of 86.36%. Accuracy was calculated using a confu- model, which was determined by considering two features
sion matrix (see Fig. (14)). (grocery, detergents paper), was 87.88%. This was calculated
The accuracy is calculated using Eq. (15): using the confusion matrix shown in Fig. 16.
79 + 35 114 The accuracy is calculated using Eq. (15):
Accuracy = = = 0.8636
79 + 11 + 7 + 35 132 79 + 37 116
Accuracy = = = 0.8788
Hence, the accuracy is 86.36%. 79 + 9 + 7 + 37 132
The ROC curve regarding the accuracy is shown in Hence, the accuracy is 87.88%.
Fig. 15. The ROC curve reflects an area of 0.84. The relation The model was trained again by considering all fea-
between the false positive rate and true positive rate presents tures. An evaluation shows a correct classification percent-
a steep curve, indicating low accuracy. age of 91.67%. A confusion matrix was drawn to calculate
the accuracy of the model (see Fig. (17)).

Fig. 19  Confusion matrix for Naïve Bayes Fig. 21  Confusion matrix for SVM

13
G. Chaubey et al.

Fig. 22  ROC curve for SVM Fig. 24  ROC curve for the random forest classifier

Fig. 23  Confusion matrix for the random forest classifier Fig. 25  Confusion matrix for SGD

The accuracy is calculated using Eq. (15):


overall accuracy achieved is calculated using the confu-
84 + 37 121
Accuracy = = = 0.9167 sion matrix shown in Fig. 19.
84 + 9 + 2 + 37 132
The accuracy is calculated using Eq. (15):
Hence, the accuracy is 91.67%.
82 + 35 117
The ROC curve for the maximum accuracy is shown Accuracy = = = 0.8864
82 + 11 + 4 + 35 132
in Fig. 18.
The relation between the false positive and true posi- Hence, the accuracy is 88.64%.
tive values shows a start at 0.68 TPR; after 0.4 FPR, both The algorithm showed no change in the performance
become constant. when the Region feature was dropped; the accuracy was
still 88.64%. The ROC is shown in Fig. 20.
The relation between false positive rate and true posi-
4.4 Naïve Bayes tive rate shows a start at about 0.62 TPR; and after about
0.7 FPR, both becomes constant.
The naïve Bayes algorithm performed well when all the
features were used to classify the customer data. The

13
Customer purchasing behavior prediction using machine learning classification techniques

Fig. 26  ROC curve for SGD Fig. 28  ROC curve-ANN

Fig. 27  Confusion matrix for ANN Fig. 29  Confusion matrix for Adaboost

The ROC curve for the SVM is shown in Fig. 22. The


4.5 Support vector machine relation between the false positive and true positive values
reflects an area of 0.97.
First, the model was trained using a support vector
machine by considering the two best features. The accu- 4.6 Random forest
racy achieved was 88.64%. Further, the features were
added according to their importance. The model’s per- The random forest classifier is an ensemble learning method,
formance increased when all the features were considered which was also trained to classify the customer data. First,
when classifying the data. The confusion matrix is shown the model was trained using the two best features; the accu-
in Fig. 21. racy achieved was 88.64%. After adding Milk and Fresh fea-
The accuracy is calculated using Eq. (15): tures to the best two features in the classification algorithm,
the results increased significantly. A confusion matrix was
81 + 37 118 drawn to calculate the accuracy (see Fig. (23)).
Accuracy = = = 0.8939
81 + 9 + 5 + 37 132 The accuracy is calculated using Eq. (15):
So, the accuracy is 89.39%.

13
G. Chaubey et al.

Fig. 30  ROC curve for AdaBoost Fig. 32  ROC curve- Xgboost

Fig. 31  Confusion matrix for XgBoost

Fig. 33  Confusion matrix for the dummy classifier.


82 + 39 121
Accuracy = = = 0.9167
82 + 7 + 4 + 39 132
So, the accuracy is 91.67%.
The ROC curve for the random forest classifier is pre-
sented in Fig. 24. The area of the ROC–AUC curve is 0.97. Accuracy =
83 + 35
=
118
= 0.8939
83 + 11 + 3 + 35 132
4.7 Stochastic gradient descent So, the accuracy obtained is 89.39%.
The ROC curve is shown in Fig. 26. This curve represents
Stochastic gradient descent uses an iterative approach to
an area of 0.86.
classify data. The model was trained by considering all fea-
tures. The accuracy of the algorithm is 89.39%.Accuracy
is calculated using the confusion matrix shown in Fig. 25.
The accuracy is calculated using Eq. (15):

13
Customer purchasing behavior prediction using machine learning classification techniques

4.9 AdaBoost

AdaBoost is an ensemble learning method that classifies the


data based on an iterative approach. It is used to increase the
efficiency of the classifier. Initially, the model was trained
using two features (grocery and detergents paper). The
model showed an accuracy of 90.91%. After adding the Milk
feature, there was an increase in the efficiency of the model,
which is shown in Fig. 29.
The accuracy is calculated using Eq. (15):
81 + 39 120
Accuracy = = = 0.9091
81 + 7 + 5 + 39 132
Hence, the accuracy achieved is 90.91%.
Fig. 34  ROC curve for the dummy classifier The ROC for AdaBoost is drawn in Fig. 30. The ROC-
AUC curve reflects an area of 0.95.

4.8 Artificial neural network


4.10 XgBoost
An ANN was trained to classify customer data by consider-
ing the two best features (grocery and detergents paper). XgBoost is an ensemble technique that uses boosting method
The model was evaluated, and an accuracy of 65.15% was to classify data. The model was trained using the XgBoost
achieved. The confusion matrix drawn for calculating the method. Then, the model was evaluated by providing all the
accuracy is shown in Fig. 27. features. The performance of the algorithm was evaluated
The accuracy is calculated using Eq. (15): using the confusion matrix shown in Fig. 31.
The accuracy is calculated using Eq. (15):
86 + 0 86
Accuracy = = = 0.6515
86 + 46 + 0 + 0 132 82 + 39 121
Accuracy = = = 0.9167
82 + 7 + 4 + 39 132
So, the accuracy obtained is 65.15%.
The ROC curve for the ANN is shown in Fig. 28. The Hence, the accuracy is 91.67%.
ROC-AUC curve reflects an area of 0.84. The ROC curve for XgBoost is drawn in Fig. 32. The
ROC-AUC curve represents an area of 0.97.

Fig. 35  Confusion matrix of SvmAda Fig. 36  Confusion matrix for RfAda

13
G. Chaubey et al.

Fig. 37  ROC curve-RfAda. Fig. 39  ROC curve - KnnSgd.

Hence, the accuracy is 59.84%. Similarly, other accuracies


are calculated. The average accuracy obtained for the model
is 53.94%. The ROC curve is shown in Fig. 34. The area
value of this ROC-AUC curve is 0.49. The false positive rate
is directly proportional to the true positive rate.

4.12 Stacking

4.12.1 SvmAda

Stacking is an ensemble method used to create a hybrid


algorithm SvmAda. SvmAda is a hybrid algorithm using
SVM and AdaBoost as the base algorithm. Initially, the
model was trained and evaluated using two features (gro-
cery and detergents paper). By considering these features,
the accuracy obtained was 87.88%. Further, the model was
trained (and improved) by adding the Milk and Fresh fea-
Fig. 38  Confusion matrix for KnnSgd tures. The accuracy was calculated using the confusion
matrix shown in Fig. 35.
The accuracy is calculated using Eq. (15):
4.11 Dummy classifier
79 + 40 119
Accuracy = = = 0.9015
79 + 6 + 7 + 40 132
The dummy classifier classifies data using simple rules.
The model was trained using the dummy classifier. The two Hence, the accuracy is 90.15%.
most important features (grocery and detergents paper) were
used for classification purposes. Variation inaccuracies were
obtained. So, the average of five consecutive outcomes was 4.12.2 RfAda
considered. The outcomes were 43.18%, 53.03%, 56.82%,
59.85%, 56.82%. A confusion matrix was drawn to calculate RfAda is a hybrid algorithm using a random forest classifier
the accuracy of the model (see Fig. (33)). and AdaBoost as the base algorithm. Considering the two
The accuracy is calculated using Eq. (15): best features (grocery and detergents paper), the accuracy
obtained was 90.90%.
62 + 17 79
Accuracy = = = 0.5984 Along with the two best features, another three features
62 + 29 + 24 + 17 132
(milk, fresh, and frozen) were added to maximize the accu-
racy. A confusion matrix was drawn to calculate the accu-
racy (see Fig. (36)).

13
Customer purchasing behavior prediction using machine learning classification techniques

Table 9  Analysis of accuracy based on features


Algorithms and Grocery, deter- Grocery, deter- Grocery, deter- Grocery, deter- Grocery, deter- Grocery, detergents
features gents paper (best gents paper, milk gents paper, milk, gents paper, milk, gents paper, milk, paper, milk, fresh,
two features) (%) (%) fresh (%) fresh, frozen (%) fresh, frozen, frozen, delicatessen,
delicatessen (%) region (%)

Logistic regression 87.88 87.88 88.64 87.88 87.88 89.39


Decision tree 82.58 84.09 85.61 85.61 86.36 86.36
KNN 87.88 87.12 90.15 90.91 90.15 91.67
Naïve Bayes 86.36 84.85 84.85 87.88 88.64 88.64
Support vector 88.64 88.64 87.88 88.64 88.64 89.39
machine
Random forest 88.64 88.64 91.67 90.91 89.39 90.15
Stochastic gradient 88.64 56.06 83.33 87.88 89.39 89.39
descent
Artificial neural 68.45 68.45 68.45 68.45 68.45 68.45
network
Adaboost 90.15 90.91 90.15 90.15 90.15 90.15
Xgboost 89.39 87.88 90.91 90.91 90.91 91.67
Dummy classifier 60.61 53.03 57.58 54.55 58.33 50.00
SvmAda 87.88 88.64 90.15 89.39 88.64 89.39
RfAda 89.39 88.64 88.64 90.91 90.91 89.39
KnnSgd 87.88 72.73 89.39 90.91 89.39 92.42

The accuracy is calculated using Eq. (15): 5 Conclusion and future scope


82 + 38 120
Accuracy = = = 0.9090 E-commerce is an important domain for data mining. This
82 + 8 + 4 + 38 132
paper describes the most important supervised techniques of
Hence, the accuracy is 90.90%. machine learning and obtain a better accuracy on this dataset
The ROC for RfAda is presented in Fig.  37. The compared to previous work.
ROC–AUC curve area is 0.95. It shows good accuracy and Some important facts that are observed by this experi-
few misclassifications. ment are:

1 A decision tree can have the same feature as its sub-


4.12.3 KnnSgd nodes.
2 The number of nodes in the hidden layers and the num-
KnnSgd is a hybrid algorithm with SGD and the k-Near- ber of hidden layers vary with the problem.
est Neighbor as a base algorithm. It yielded an accuracy 3 The accuracy of the model does not depend on the
of 92.42%. The confusion matrix is drawn to calculate the feature selection (e.g., the best features that you have
accuracy of the model (see Fig. (38)). selected may or may not give the best accuracy).
The accuracy is calculated using Eq. (15):
Apart from these observations, the highest accuracy is
84 + 38 122
Accuracy = = = 0.9242 observed for the KnnSgd algorithm. Also, when only the two
84 + 8 + 2 + 38 132
best features are considered, the highest accuracy is obtained
Hence, the accuracy is 92.42%. The maximum accuracy is by the Adaboost algorithm (90.15%). A detailed description
obtained by KnnSgd. The ROC curve for KnnSgd is pre- of the results is shown in Table (9).
sented in Fig. 39. The ROC–AUC curve area is 0.96. The
false positive rate and true positive rate have a close relation.
Both values are initially inclined towards zero.

13
G. Chaubey et al.

Using a machine learning technique is good for small Breiman L (2001) Random forests. Mach Learn 45(1):5–32
datasets, while more parameters deep learning should be Cardoso (2014) Uci machine learning repository
Cardoso MGMS (2012) Logical discriminant models. In: Quantitative
used for large datasets. The supervised learning algorithms modelling in marketing and management. https://d​ oi.o​ rg/1​ 0.1​ 142/​
are limited to static datasets. If there is a dynamic dataset, 97898​14407​724_​0008
the focus should be on the time series or the deep learning Charanasomboon T, Viyanon W (2019) A comparative study of repeat
models. Customer analysis is important for all companies. In buyer prediction. In Proceedings of the 2019 2nd international
conference on information science and systems. ACM
the future, the demand of objected customers requires selling Chaubey G, Bisen D, Arjaria S, Yadav V (2020) Thyroid disease pre-
the products to the desired customer. Hence, the analysis of diction using machine learning approaches. Natl Acad Sci Lett
the customer is important. 44(3):233–238
The highest accuracy obtained is 92.42%. The variety of Chen T, Guestrin C (2016) XGBoost. In Proceedings of the 22nd ACM
SIGKDD international conference on knowledge discovery and
the data can be changed to gain more accuracy on the predic- data mining. ACM
tion of personality. Although the analysis is done on most Das TK (2015) A customer classification prediction model based on
of the important machine learning algorithms, a combina- machine learning techniques. In 2015 International conference on
tion of new hybrid algorithms and a new dataset with more applied and theoretical computing and communication technology
(iCATccT). IEEE
instances may improve a model’s accuracy. Due to advance- Dawood EAE, Elfakhrany E, Maghraby FA (2019) Improve profiling
ments in deep learning, customer purchasing behavior can bank customer’s behavior using machine learning. IEEE Access
be analyzed with video data. A more innovative solution can 7:109320–109327
be given in smart malls to predict or suggest which items Do QH, Trang TV (2020) An approach based on machine learning
techniques for forecasting vietnamese consumers’ purchase behav-
customers will purchase according to their needs. iour. Decis Sci Lett, pp 313–322. http://​www.​growi​ngsci​ence.​
In the future, explainable AI can be used to make the com/​dsl/​Vol9/​dsl_​2020_​16.​pdf
models more transparent to understand the logic behind the Dreiseitl S, Ohno-Machado L (2002) Logistic regression and artificial
segregation of the customers, which will make models act neural network classification models: a methodology review. J
Biomed Inform 35(5–6):352–359
as white boxes. Explaining the important features for each Džeroski S, Ženko B (2004) Is combining classifiers with stacking
classification is a new challenge that needs to be resolved. better than selecting the best one? Mach Learn 54(3):255–273
Freund Y, Schapire RE (1999) A short introduction to boosting. J Jpn
Soc Artif Intell 14(5):771–780
Gupta G, Aggarwal H (2012) Improving customer relationship manage-
References ment using data mining. Int J Mach Learn Comput, pp 874–877.
http://​www.​ijmlc.​org/​papers/​256-​L40070.​pdf
Adebola Orogun BO (2019) Predicting consumer behaviour in digital Hehn TM, Kooij JFP, Hamprecht FA (2019) End-to-end learning of
market: a machine learning approach. Int J Innov Res Sci Eng decision trees and forests. Int J Comput Vision 128(4):997–1011
Technol 8(8):8391–8402 Kachamas P, Akkaradamrongrat S, Sinthupinyo S, Chandrachai A
Adeniyi D, Wei Z, Yongquan Y (2016) Automated web usage data min- (2019) Application of artificial intelligent in the prediction of con-
ing and recommendation system using k-nearest neighbor (KNN) sumer behavior from facebook posts analysis. Int J Mach Learn
classification method. Appl Comput Inform 12(1):90–108 Comput 9(1):91–97
Agatonovic-Kustrin S, Beresford R (2000) Basic concepts of artificial Kaviani P, Dhotre MS (2017) Short survey on naive bayes
neural network (ANN) modeling and its application in pharma- algorithm-ijaerd
ceutical research. J Pharm Biomed Anal 22(5):717–727 Kohavi R, Mason L, Parekh R, Zheng Z (2004) Lessons and challenges
Ali J, Khan R, Ahmad N, Maqsood I (2012) Random forests and deci- from mining retail e-commerce data. Mach Learn 57(1/2):83–113
sion trees. Int J Comp Sci 9(5). http://​ijcsi.​org/​papers/​IJCSI-9-​ Lavrač N, Cestnik B, Gamberger D, Flach P (2004) Decision support
5-3-​272-​278.​pdf through subgroup discovery: three case studies and the lessons
Alloghani M, Al-Jumeily D, Baker T, Hussain A, Mustafina J, Aljaaf learned. Mach Learn 57(1/2):115–143
AJ (2018) Applications of machine learning techniques for soft- Liu W, Wang J, Sangaiah AK, Yin J (2018) Dynamic metric embed-
ware engineering learning and early prediction of students’ perfor- ding model for point-of-interest prediction. Futur Gener Comput
mance. In Communications in computer and information science, Syst 83:183–192
Springer Singapore, pp 246–258 Momin S, Bohra T, Raut P (2019) Prediction of customer churn using
Amin A, Shah B, Khattak A. M, Baker T, ur Rahman Durani H, Anwar machine learning. In EAI international conference on big data
S (2018) Just-in-time customer churn prediction: eith and with- innovation for sustainable cognitive computing. Springer Inter-
out data transformation. In 2018 IEEE congress on evolutionary national Publishing, pp 203–212
computation (CEC). IEEE Nalepa J, Kawulok M (2018) Selecting training sets for support vector
Bala R, Kumar D (2017) Classification using ANN: a review. Int J machines: a review. Artif Intell Rev 52(2):857–900
Comput Intell Res 13(7):1811–1820 Raghuwanshi BS, Shukla S (2018) Class-specific extreme learning
Bottou L (2010) Large-scale machine learning with stochastic gradi- machine for handling binary class imbalance problem. Neural
ent descent. In Proceedings of COMPSTAT’2010. Physica-Verlag Netw 105:206–217
HD, pp 177–186 Rokach L, Maimon O (2005) Decision trees. In: Maimon O, Rokach L
(eds) Data mining and knowledge discovery handbook. Springer,
Boston, MA. https://​doi.​org/​10.​1007/0-​387-​25465-X_9

13
Customer purchasing behavior prediction using machine learning classification techniques

Sánchez-Franco MJ, Navarro-García A, Rondán-Cataluña FJ (2019) learning techniques for churn prediction and factor identification
A Naive Bayes strategy for classifying customer satisfaction: a in telecom sector. IEEE Access 7:60134–60149
study based on online reviews of hospitality services. J Bus Res Vafeiadis T, Diamantaras K, Sarigiannidis G, Chatzisavvas K (2015)
101:499–506 A comparison of machine learning techniques for customer churn
Sangaiah AK, Medhane DV, Han T, Hossain MS, Muhammad G (2019) prediction. Simul Model Pract Theory 55:1–9
Enforcing position-based confidentiality with machine learning Zhang Z (2016) Introduction to machine learning: k-nearest neighbors.
paradigm through mobile edge computing in real-time industrial Ann Transl Med 4(11):218–218
informatics. IEEE Trans Industr Inf 15(7):4189–4196 Zhao B, Takasu A, Yahyapour R, Fu X (2019) Loyal consumers or
Santharam A, Krishnan SB (2018) Survey on customer churn predic- one-time deal hunters: repeat buyer prediction for e-commerce.
tion techniques. Int Res J Eng Tech 5(11):131–137 In 2019 International conference on data mining workshops
Schapire RE (2013) Explaining AdaBoost. Empirical inference. (ICDMW). IEEE
Springer, Berlin Heidelberg, pp 37–52
Sweilam NH, Tharwat A, Moniem NA (2010) Support vector machine Publisher's Note Springer Nature remains neutral with regard to
for diagnosis cancer disease: a comparative study. Egypt Inform jurisdictional claims in published maps and institutional affiliations.
J 11(2):81–92
Ullah I, Raza B, Malik AK, Imran M, Islam SU, Kim SW (2019) A
churn prediction model using random forest: analysis of machine

13

View publication stats

You might also like