Customer Purchasing Behavior Prediction Using Machine Learning Classification Techniques
Customer Purchasing Behavior Prediction Using Machine Learning Classification Techniques
net/publication/360048647
CITATIONS READS
0 148
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Gyanendra Chaubey on 20 April 2022.
ORIGINAL RESEARCH
Abstract
Many sales and service-providing companies need to talk up related customers while launching the new products, services,
and updated versions of existing products. While doing so, they need to target their existing customers. The behavior of these
customers gives companies information about how to sell products. This paper presents a comparative study of different
machine learning techniques that have been applied to the problem of customer purchasing behavior prediction. Experi-
ments are done using supervised classification machine learning techniques like logistic regression, decision tree, k-nearest
neighbors (KNN), Naïve Bayes, SVM, random forest, stochastic gradient descent (SGD), ANN, AdaBoost, XgBoost, and
dummy classifier, as well as some hybrid algorithms that use stacking like SvmAda, RfAda, and KnnSgd. Models are
evaluated using the cross-validation technique. Furthermore, the confusion matrix and ROC curve are used to calculate the
accuracy of each model. Finally, the best classifier is a hybrid classifier using the ensemble stacking technique (KnnSgd),
with an accuracy of 92.42%. KnnSgd gives the highest accuracy with maximum features because the error of the KNN and
SGD are minimized by the KNN at the end.
Keywords Machine learning · Supervised classification algorithms · Customer purchasing behavior prediction ·
E-commerce
13
Vol.:(0123456789)
G. Chaubey et al.
13
Customer purchasing behavior prediction using machine learning classification techniques
The results indicated that the decision tree outperformed the 3 Proposed procedure
other methods, achieving an accuracy of 91.67%.
Furthermore, Ullah et al. (2019) proposed a prediction The proposed work is done using wholesale customers’data
model to identify churn customers, along with the factors from the UCI-Irvin machine learning data repository. The
underlying the churning of customers. Different classifi- dataset has eight attributes, including channel (HoReCa and
cation methods were used to classify churn customers, in Retail customers) as the target class. The total number of
which a random forest (RF) algorithm performed well, with instances is 440 (298 HoReCa customers and 142 Retail
88% accuracy. Amin et al. (2018) used a just-in-time cus- customers) (Cardoso 2014). Various classification methods
tomer churn prediction (JIT-CCP) model to provide a com- are used to predict the class of the customers. The whole
parison and effect of state-of-art data transformation. They procedure of analysis is divided into the following steps is
conducted three experiments to calculate the performance shown in Fig. 1:
of the JIT-CCP model. First, JIT-CCP without data transfor-
mation achieved 49.9% accuracy, JIT-CCP on long method 1 Gathering the data
achieved 49.1% accuracy, and at last used rank method with 2 Preprocessing the data
JIT-CCP is used to achieve 59.3% accuracy. 3 Feature selection
Alloghani et al. (2018) discussed the applications of 4 Fitting the model
machine learning and its applications in software engineer- 5 Measuring the model’s accuracy
ing learning and predicting the performance of students.
They have used seven techniques and two datasets. The neu- 3.1 Data preprocessing
ral network gave the best accuracy for the first dataset, and
random forest gave the best accuracy for the second dataset. Preprocessing is a step in data science to clean, transform
Likewise, Liu et al. (2018) addressed the problem of data and reduce the data to better fit the model. There are various
sparsity and studied temporal factors systemically. In this methods of data preprocessing:
work, POI prediction was done to combine temporal and
spatial factors in a Euclidean space. On this basis, a DME-
TS model was proposed. It was based on metric embed-
ding that constructed a model that was more flexible for
various contextual factors. Finally, the authors evaluated
performance with other models (e.g., BPR-MF, PRME-G,
WMF, WWO, and ST-RNN) and observed improved results
on benchmark datasets such as Foursquare and Gowalla.
After that, a machine learning-based method was pro-
posed for conserving position confidentiality of roam-
ing PBS users. The authors identified user position by
combining decision tree and K-nearest neighbor learning
techniques. In this method, the hidden Markov model was
applied to estimate users’destinations, including their posi-
tion track sequence. Finally, this work was evaluated 90%
of position confidentiality in PBS (Sangaiah et al. 2019).
In 2018, researchers proposed a class-specific extreme
learning machine (CS-ELM) to handle the binary class
imbalance problem using class-specific regularization
parameters. This method yielded a lower computational
overhead and higher accuracy than a previous weighted
extreme machine learning model in 38 different datasets
taken from the KEEL dataset repository (Raghuwanshi and
Shukla 2018).
Based on previous works, this paper describes every algo-
rithm in-depth, along with geometrical and mathematical
intuition of machine learning classification techniques. In
this work, customer behavior has been analyzed as either
Retail or hotel, restaurant, and cafe (HoReCa).
Fig. 1 Procedure of analysis
13
G. Chaubey et al.
(O − E)2
X2 =
E
where:
O = Observed frequencies
E = Expected frequencies
For numerical data, the formula is
(n − 1)s2
X2 =
𝜎2
where:
n = Sample size
Fig. 2 Representation of the plane of a logistic regression
13
Customer purchasing behavior prediction using machine learning classification techniques
y(x) = wT x (2) y = −1
wT x + b y ∗ wT x > 0
d= (3)
‖w‖
This shows that the point is correctly classified.
Since the plane passes through (0, 0) and because the plane From the above three cases, it can be concluded that obtain-
is a unit vector (i.e.,||w|| = 1), Eq. (3) can be represented as ing a correctly classified point maximizes the cost function:
d = wT x ( )
∑( )
If there are multiple points above and below the plane, the max yi ∗ wTi xi
points above the plane are considered positive points, while i
This shows that the point is correctly classified. Then, the sigmoid function is
Case 2 In the second case, the (negative) point is a triangle 1
(y = −1) lying above the plane-thus, the distance of the point h(x) =
1 + e−z
is positive. Hence, the cost function will be negative, as shown
in the equations below. The mutual probability of prediction of a class is given as
13
G. Chaubey et al.
myi
∑ Table 1 Example data for a decision tree
l(wT ) = log h(xi ) + (1 − yi ) log(1 − h(xi ))
i
Ex. Purchasing Revisiting Banking Target
Then, the stochastic gradient descent algorithm for (x, y) is D1 High Less Increased HoReCa
calculated as D2 High Less Medium Retail
D3 Low More Medium Retail
wT = wT + 𝛼Δ(wT ) l(wT ) (5) D4 High Less Medium HoReCa
D5 Low Less Increased Retail
When performing the partial differentiation of l(wT ) ,
Δ(wT ) l(wT ) is given as
Δ(wT ) l(wT ) = (y − h(wT ) (x))xi 1 Information gain method: Let us first calculate the
entropy of the target variable. The entropy is given as:
Hence, putting the value of Δ(wT ) l(wT ) into Eq. (5), the
updated weight is given as E(s) = Σi pi log2 pi (6)
wTi = wTi + 𝛼(y − h(wT ) (x))xi After expanding Eq. (6) for the binary class, it is given
as
where w is updated until the cost function is maximized.
E(s) = −p− log2 p− − p+ log2 p+
In the proposed work, a logistic regression classifier
is imported from the linear models of the sklearn library. Hence, in the example above, HoReCa (h) = 2 , Retail
(r) = 3 , and entropy is
| |
|S High | ) ||S Low ||
Gain(S, Purchasing) = Entropy(S) − | | ∗ Entropy(S ( )
High − ∗ Entropy S Low
|S| |S|
( )
leaf node denotes the class. The primary decision node is Entropy S High = 0.918
known as the root node. Each decision node is selected
based on two popular methods (Hehn et al. 2019): ( )
Entropy S Low = 0
13
Customer purchasing behavior prediction using machine learning classification techniques
∑
Gini node (N) = 1 − (1 − p(t))2
(7)
t∈ target values
3.3.3 K‑Nearest neighbor
Since Gain is high for purchasing for both the values In the above equation, p can be any real value, though it
of Banking, the tree can be drawn as shown in Fig. 4. is typically set between 1 and 2. For values of p less than
The algorithm applied to draw this tree is the ID3 algo- 1, the formula does not define a valid distance metric
rithm. Similarly, in the proposed work, the decision tree 4 Hamming distance: Hamming distance is used in the
is drawn and evaluated using the sklearn library. case of categorical values:
2 Gini index method: The Gini index is also used to cal- n
culate the gain using Eq. (7) and draw the tree. The Gini ∑
DH = |xi − yi |
value on the node is calculated as (i=1)
13
G. Chaubey et al.
Table 2 Example data for KNN Instances F1 F2 Y Table 3 Example data for Naïve Bayes
Purchasing Revisiting Banking Client
I1 6 4 1
I2 2 3 0 High Less Increased HoReCa
I3 4 5 1 High Less Medium Retail
I4 3 6 0 Low More Medium Retail
I5 5 8 1 High Less Medium HoReCa
Low Less Increased Retail
Low More Increased HoReCa
Low More Medium Retail
x=y→D=0
High Less Medium Retail
High More Increased HoReCa
x≠y→D=1 Low Less Increased HoReCa
I2 = 4
√
I3 = 2 2 = 2.82
Table 6 Probability table for banking
√ Banking Client=HoReCa Client=Retail Client
I4 = 2 = 1.41
Increased 4/5 1/5 HoReCa
√ Medium 1/5 4/5 Retail
I5 = 10 = 3.16
The three nearest points are I4, I3, and I5. Now, the class
of each point (I3, I4, and I5) are 1, 0, and 1. Here class 1
makes up the majority. Hence, the query point is predicted
P( X ) ∗ P(Y)
in class label 1. Similarly, the K-nearest neighbor classi- Y
P( ) = Y
fier of the sklearn library is used to make predictions in the X P(X)
proposed work.
This implies that
Y Y X … Xn
3.3.4 Naïve Bayes P( ) ∝ P( ) ∗ P(Y) = P( 1 ) ∗ P(Y)
X X Y
Naïve Bayes is a classification technique that uses the In above equation, the difficulty is in calculating the joint
Bayesian theorem to predict the class of a new feature set probability. Hence, the Naïve Bayes classification is used. It
(Sánchez-Franco et al. 2019). assumes that all the features are conditionally independent
The Bayesian formula is given as (Kaviani and Dhotre 2017):
13
Customer purchasing behavior prediction using machine learning classification techniques
P(
X1 … Xn X X X
) = P( 1 )P( 2 ) … P( n ) Since P( Retail
X
) > P( Horeca
X
) , label Xnew is Retail. In the pro-
new new
Y Y Y Y posed work, the naïve Bayes classifier is imported from the
Hence, the probability for Xnew =< X1 … Xn > is calculated sklearn library to learn and evaluate the model.
as (Eq. (9))
∑ Xi
Y 3.3.5 Support vector machines
P(
X1 … Xn
) = P(Y) ∗ P( )
Y (9)
i
Support vector machines are one of the most effective algo-
Now, the example presented in Table (3) is considered to rithms to solve linear problems (Nalepa and Kawulok 2018).
explain the underlying mathematics better. Purchasing, Although logistic regression also classifies linear problems,
Revisiting, and Banking are the features, and the client is SVMs use the concept of support vectors to perform linear
the target. separation. It has a clever way to reduce overfitting and can
When the data is given to the model, it learns the prob- use many features without requiring too much computation.
ability of each feature, shown in Tables (4), (5), (6).
P(Client = Horeca) = 5∕10 1 Mathematical intuition
P(Client = Retail) = 5∕10
Consider Fig. 5, which is a hypothetical example of a data-
Now, a new tuple is given as set that is linearly separable, with a decision boundary
drawn as a solid line as a plane with two dotted lines serv-
Xnew = (Purchasing = High, Revisiting = More, Banking = Medium)
ing as the positive and negative planes. The green stars
Now, predict the label for Xnew (see the table of the learning are considered as positive points, while the orange circles
phase). are negative points. The equation of the positive plane is
given as wT x + b = +1 , the equation of the negative plane
P(Purchasing = High|Client = Horeca) = 3∕5 is given as wT x + b = −1 , and equation of the hypothesis
P(Purchasing = High|Client = Retail) = 2∕5 plane is given as wT x + b = 0.
P(Revisiting = More|Client = Horeca) = 2∕5 The positive and negative planes change as the hypoth-
esis plane changes. Here, this is explained by taking
P(Revisiting = More|Client = Retail) = 2∕5
points that are linear separable; however, in the real world,
P(Banking = Medium|Client = Horeca) = 1∕5 the data can be non-separable. The distance from the
P(Banking = Medium|Client = Retail) = 4∕5
P(Clinet = Horeca) = 5∕10
P(Clinet = Retail) = 5∕10
Horeca 3 2 1 5
P( ) = ( ) ∗ ( ) ∗ ( ) ∗ ( ) = 0.024
Xnew 5 5 5 10
13
G. Chaubey et al.
given by d−.
n
Hence the equation for the positive plane is given as 𝜕Lp ∑
=0→ 𝛼i yi = 0
wT d+ + b = +1 and that of the negative plane is given as 𝜕b i=1
wT d− + b = −1 . Now, solving both equations will get
m m m
wT (d+ − d− ) = 2 ∑ 1∑ ∑
minw Lp (w, b, 𝛼i ) = 𝛼i − 𝛼i 𝛼j yi yj (xiT xj ) − b 𝛼i yi
i=1
2 i,j=1 i=1
After dividing both sides by ||w||, the following equation is
derived:
m m
∑ 1∑
w T 2 minw Lp (w, b, 𝛼i ) = 𝛼i − 𝛼 𝛼 y y (xT x )
(d − d− ) = 2 i,j=1 i j i j i j
||w|| + ||w|| i=1
13
Customer purchasing behavior prediction using machine learning classification techniques
Fig. 6 Random forest classification include its efficiency and ease of implementation (Bottou
2010). In this work, a linear support vector machine is used
as a classifier, and a gradient descent algorithm is applied
3.3.6 Random forest to optimize the results.
13
G. Chaubey et al.
After the linear combination of weight and input, it Table 7 Dataset for Adaboost F1 F2 F3 Y
goes through the nonlinear activation functions, such
as a sigmoid function, ReLU function, or softmax func- X11 X12 X13 Y+
tion. A perceptron is trained with a gradient descent X21 X22 X23 Y-
algorithm. X31 X32 X33 Y-
A multilayer neural network is trained with the back- X41 X42 X43 Y+
propagation algorithm. Let x1 , x2 , x3 , … , xn be the input and X51 X52 X53 Y+
y1 , y2 , y3 , … , yn be the output values. The predicted val-
ues corresponding to each input value are ŷ1 , ŷ2 , ŷ3 , … , ŷn .
Weight initialization is done with small random numbers.
For the one output neuron, the error is given as
1 3.3.9 AdaBoost
E= (y − ŷ )2 (11)
2
AdaBoost is an ensemble technique and a boosting algo-
For each node j, the output ôj is defined as rithm. It combines the weak learners or classifiers to improve
( n ) their performance (Schapire 2013). Each learner is trained
∑
ôj = �(netj ) = � wkj ôk with a simple set of training samples. Each sample has a
k=1 weight, and the sample of all the weights is adjusted itera-
tively. AdaBoost iteratively trains each learner and calculates
The netj is the weighted sum of the outputs ôk of the previ- a weight for each one, and each weight represents the robust-
ous n neurons. ness of each weak learner. Here, a decision tree is used as
Finding the derivative of Eq. (11), that is an error, with a base learner (Freund and Schapire 1999). The AdaBoost
respect to weight: algorithm has three main steps:
𝜕E 𝜕E 𝜕oj 𝜕netj • Sampling: In this step, some samples Dt are selected from
=
𝜕wij 𝜕oj 𝜕netj 𝜕wij the training set, where Dt is the set of samples in the
iteration t.
( )
𝜕E ∑ 𝜕E 𝜕ol • Training: In this step, different classifiers are trained
= w �(netj )(1 − �(netj ))oi = 𝛿j oi using Dt , and the error rate ∈i for each classifier is calcu-
𝜕wij l
𝜕ol 𝜕netl jl
lated.
Where • Combination: Here, all trained models are combined.
𝜕E 𝜕oj
𝛿j = = (oj − yj )oj (1 − oj )
𝜕oj 𝜕netj 1 Mathematical intuition
if j is an output neuron, and Consider a dataset (Table 7) comprising five samples, where
∑ F1, F2, and F3 are features and Y is the output.
𝜕E 𝜕oj
𝛿j = = ( 𝛿zl wjl )oj (1 − oj )
𝜕oj 𝜕netj z Step1 : Assign the same weights to each record. This is
if j is an inner neuron. called sample weights and is given by
To update the weight wij using gradient descent, one must 1
choose a learning rate 𝜂. w=
n
Δwij = −𝜂
𝜕E where n is the number of records. Hence, each record is
𝜕wij assigned a weight of 15.
In this algorithm, the tree should not be created as in
Hence, weight updating is done as follows:
a random forest. Instead, the decision tree is constructed
wij ← wij + Δwij up to one depth (called stumps) for each feature. From
these stumps, select the first base learner model using the
entropy or Gini index. The least entropy stump is selected
Δwij = 𝜂𝛿j xij
as the first base learner model.
13
Customer purchasing behavior prediction using machine learning classification techniques
Step2 : After the first base learner model, suppose it Suppose three decision trees are formed. In this case, test
truly classifies three samples, two of which are miscalcu- data are passed, and classifications are obtained as Y+ , Y− ,
lated. Then, the total error is calculated by summing all and Y+ . Then using the majority voting method, the output
the misclassified sample weights. Here, two samples are is Y+.
misclassified as having a weight of 15.
Then, Total error
3.3.10 XgBoost
2
(TE) = = 0.4
5
XgBoost is an ensemble technique and a boosting algorithm.
Step3 : Calculate the performance of the stump. The for- It uses the decision tree as a base model. The main advan-
mula for this is tages of XgBoost are that it is scalable and that it uses par-
( ) allel and distributed computing to offer memory-efficient
1 (1 − TE) usage (Chen and Guestrin 2016). Steps in a Gradient Boost-
Performanceofstump = loge (12)
2 TE ing Algorithm:
Putting the values into Eq.(12) yields
1 Take the first output ŷ as a constant based on a base clas-
( ) sifier.
1 (1 − 0.4)
Performanceofstump = loge = 0.20 2 Compute the errors or residual (R1 ) using any loss func-
2 0.4
tion.
Step4 : Update the weights of the misclassified points to 3 Construct first decision tree with the features and resid-
pass the samples to the next sequential base learner. ual (R1 ) as the target variable. After getting the predic-
tions from this decision tree, consider (R2 ) as the new
Newweight = oldweight ∗ ePerformanceofstump (13) residual such that R2 < R1.
4 To compute the final result, combine both the base
Putting the values in Eq. (13) yields
classifier and the decision trees sequentially. However,
1 the direct combination of a decision tree with the base
Newweight = ∗ e0.20 = 0.24
5 model will overfit the whole classifier. Hence, a learning
rate is introduced in the combination as follows:
This updated weight is greater than 15.
Decrease the weight of the correctly classified points y(x) = Obase (x) + 𝛼ODT1 (x)
as follows:
, where alpha is the learning rate.
Newweight = oldweight ∗ e−Performanceofstump (14) 5 If y(x) is significantly different from the actual output,
then add another tree sequentially based on input fea-
Putting the value to the Eq.(14) yields tures and residual R2 as the target variable.
1
Newweight = ∗ e−0.20 = 0.16. Suppose n decision trees are required. Then, the final output
5
is given as
Hence, the updated weight is 0.16 for the three correctly
classified points and 0.24 for the two misclassified points. h(x) = h0 (x) + 𝛼1 h1 (x) + 𝛼2 h2 (x) + … 𝛼n hn (x)
The sample weights have a sum of 1. Hence, the updated
or
weight has to be normalized. The next step is to sum all the
updated weights as follows: n
∑
h(x) = h0 (x) + 𝛼i hi (x)
0.16 ∗ 3 + 0.24 ∗ 2 = 0.96 i=1
13
G. Chaubey et al.
Table 8 Confusion matrix
Actual class
1 Stratified: Generates a random set prediction by seeing 3.3.12.2 RfAda A random forest classifier and AdaBoost
the training set class distribution. worked as the base model, and KNN is used as a meta
2 Mostfrequent: Always predicts the most frequent label model. Due to the base models, the model is named as
in the training set. RfAda (random forest and AdaBoost). The architecture of
3 Prior: Always predicts the class that maximizes the prior the model is given in Fig. 9:
class.
4 Uniform: Generates predictions uniformly at random. 3.3.12.3 KnnSgd The base classifiers of this model are
5 Constant: Always predicts the constant label provided KNN and SGD, this is why it is named as KnnSgd. The
by the user. meta model is KNN. The architecture of KnnSgd is shown
in Fig. 10:
In this paper, the stratified method is used.
13
Customer purchasing behavior prediction using machine learning classification techniques
Fig. 11 Visualization of data
13
G. Chaubey et al.
4 Result and analysis
Each algorithm has been implemented using common steps. 4.1 Logistic regression classifier
Feature selection is done using the chi-square test. The two
best features of Grocery and Detergents paper were obtained. A logistic regression classifier classifies data based on the
The visualization of the data across the two best features is sigmoid function. Accuracy is calculated based on select-
shown in Fig. 11. ing the features. When evaluating the accuracy of the two
The data is divided into two parts: best features, the obtained result is 87.88%. The selection
of the features is increased one by one to increase the per-
1 Training set (70%) formance of the model. When selecting all the features, the
2 Test set (30%) highest accuracy of this model is 89.39%. The confusion
matrix is shown in Fig. 12.
The analysis result for each algorithm is implemented below. Accuracy is calculated using Eq.(15):
13
Customer purchasing behavior prediction using machine learning classification techniques
TP + TN
Accuracy =
TP + FN + FP + TN
83 + 35 118
Accuracy = == = 0.8939.
83 + 11 + 3 + 35 132
Hence, the accuracy obtained is 89.39%. The confusion
matrix can also be used to get the precision, recall, and f1
score. A low false positive value can be seen in the confu-
sion matrix of the logistic regression model.
The ROC curve regarding accuracy is shown in Fig. 13.
The ROC curve area is 0.97. The ROC curve represents the
relation between the false positive and true positive values.
13
G. Chaubey et al.
Fig. 22 ROC curve for SVM Fig. 24 ROC curve for the random forest classifier
Fig. 23 Confusion matrix for the random forest classifier Fig. 25 Confusion matrix for SGD
13
Customer purchasing behavior prediction using machine learning classification techniques
13
G. Chaubey et al.
13
Customer purchasing behavior prediction using machine learning classification techniques
4.9 AdaBoost
13
G. Chaubey et al.
4.12 Stacking
4.12.1 SvmAda
13
Customer purchasing behavior prediction using machine learning classification techniques
13
G. Chaubey et al.
Using a machine learning technique is good for small Breiman L (2001) Random forests. Mach Learn 45(1):5–32
datasets, while more parameters deep learning should be Cardoso (2014) Uci machine learning repository
Cardoso MGMS (2012) Logical discriminant models. In: Quantitative
used for large datasets. The supervised learning algorithms modelling in marketing and management. https://d oi.o rg/1 0.1 142/
are limited to static datasets. If there is a dynamic dataset, 9789814407724_0008
the focus should be on the time series or the deep learning Charanasomboon T, Viyanon W (2019) A comparative study of repeat
models. Customer analysis is important for all companies. In buyer prediction. In Proceedings of the 2019 2nd international
conference on information science and systems. ACM
the future, the demand of objected customers requires selling Chaubey G, Bisen D, Arjaria S, Yadav V (2020) Thyroid disease pre-
the products to the desired customer. Hence, the analysis of diction using machine learning approaches. Natl Acad Sci Lett
the customer is important. 44(3):233–238
The highest accuracy obtained is 92.42%. The variety of Chen T, Guestrin C (2016) XGBoost. In Proceedings of the 22nd ACM
SIGKDD international conference on knowledge discovery and
the data can be changed to gain more accuracy on the predic- data mining. ACM
tion of personality. Although the analysis is done on most Das TK (2015) A customer classification prediction model based on
of the important machine learning algorithms, a combina- machine learning techniques. In 2015 International conference on
tion of new hybrid algorithms and a new dataset with more applied and theoretical computing and communication technology
(iCATccT). IEEE
instances may improve a model’s accuracy. Due to advance- Dawood EAE, Elfakhrany E, Maghraby FA (2019) Improve profiling
ments in deep learning, customer purchasing behavior can bank customer’s behavior using machine learning. IEEE Access
be analyzed with video data. A more innovative solution can 7:109320–109327
be given in smart malls to predict or suggest which items Do QH, Trang TV (2020) An approach based on machine learning
techniques for forecasting vietnamese consumers’ purchase behav-
customers will purchase according to their needs. iour. Decis Sci Lett, pp 313–322. http://www.growingscience.
In the future, explainable AI can be used to make the com/dsl/Vol9/dsl_2020_16.pdf
models more transparent to understand the logic behind the Dreiseitl S, Ohno-Machado L (2002) Logistic regression and artificial
segregation of the customers, which will make models act neural network classification models: a methodology review. J
Biomed Inform 35(5–6):352–359
as white boxes. Explaining the important features for each Džeroski S, Ženko B (2004) Is combining classifiers with stacking
classification is a new challenge that needs to be resolved. better than selecting the best one? Mach Learn 54(3):255–273
Freund Y, Schapire RE (1999) A short introduction to boosting. J Jpn
Soc Artif Intell 14(5):771–780
Gupta G, Aggarwal H (2012) Improving customer relationship manage-
References ment using data mining. Int J Mach Learn Comput, pp 874–877.
http://www.ijmlc.org/papers/256-L40070.pdf
Adebola Orogun BO (2019) Predicting consumer behaviour in digital Hehn TM, Kooij JFP, Hamprecht FA (2019) End-to-end learning of
market: a machine learning approach. Int J Innov Res Sci Eng decision trees and forests. Int J Comput Vision 128(4):997–1011
Technol 8(8):8391–8402 Kachamas P, Akkaradamrongrat S, Sinthupinyo S, Chandrachai A
Adeniyi D, Wei Z, Yongquan Y (2016) Automated web usage data min- (2019) Application of artificial intelligent in the prediction of con-
ing and recommendation system using k-nearest neighbor (KNN) sumer behavior from facebook posts analysis. Int J Mach Learn
classification method. Appl Comput Inform 12(1):90–108 Comput 9(1):91–97
Agatonovic-Kustrin S, Beresford R (2000) Basic concepts of artificial Kaviani P, Dhotre MS (2017) Short survey on naive bayes
neural network (ANN) modeling and its application in pharma- algorithm-ijaerd
ceutical research. J Pharm Biomed Anal 22(5):717–727 Kohavi R, Mason L, Parekh R, Zheng Z (2004) Lessons and challenges
Ali J, Khan R, Ahmad N, Maqsood I (2012) Random forests and deci- from mining retail e-commerce data. Mach Learn 57(1/2):83–113
sion trees. Int J Comp Sci 9(5). http://ijcsi.org/papers/IJCSI-9- Lavrač N, Cestnik B, Gamberger D, Flach P (2004) Decision support
5-3-272-278.pdf through subgroup discovery: three case studies and the lessons
Alloghani M, Al-Jumeily D, Baker T, Hussain A, Mustafina J, Aljaaf learned. Mach Learn 57(1/2):115–143
AJ (2018) Applications of machine learning techniques for soft- Liu W, Wang J, Sangaiah AK, Yin J (2018) Dynamic metric embed-
ware engineering learning and early prediction of students’ perfor- ding model for point-of-interest prediction. Futur Gener Comput
mance. In Communications in computer and information science, Syst 83:183–192
Springer Singapore, pp 246–258 Momin S, Bohra T, Raut P (2019) Prediction of customer churn using
Amin A, Shah B, Khattak A. M, Baker T, ur Rahman Durani H, Anwar machine learning. In EAI international conference on big data
S (2018) Just-in-time customer churn prediction: eith and with- innovation for sustainable cognitive computing. Springer Inter-
out data transformation. In 2018 IEEE congress on evolutionary national Publishing, pp 203–212
computation (CEC). IEEE Nalepa J, Kawulok M (2018) Selecting training sets for support vector
Bala R, Kumar D (2017) Classification using ANN: a review. Int J machines: a review. Artif Intell Rev 52(2):857–900
Comput Intell Res 13(7):1811–1820 Raghuwanshi BS, Shukla S (2018) Class-specific extreme learning
Bottou L (2010) Large-scale machine learning with stochastic gradi- machine for handling binary class imbalance problem. Neural
ent descent. In Proceedings of COMPSTAT’2010. Physica-Verlag Netw 105:206–217
HD, pp 177–186 Rokach L, Maimon O (2005) Decision trees. In: Maimon O, Rokach L
(eds) Data mining and knowledge discovery handbook. Springer,
Boston, MA. https://doi.org/10.1007/0-387-25465-X_9
13
Customer purchasing behavior prediction using machine learning classification techniques
Sánchez-Franco MJ, Navarro-García A, Rondán-Cataluña FJ (2019) learning techniques for churn prediction and factor identification
A Naive Bayes strategy for classifying customer satisfaction: a in telecom sector. IEEE Access 7:60134–60149
study based on online reviews of hospitality services. J Bus Res Vafeiadis T, Diamantaras K, Sarigiannidis G, Chatzisavvas K (2015)
101:499–506 A comparison of machine learning techniques for customer churn
Sangaiah AK, Medhane DV, Han T, Hossain MS, Muhammad G (2019) prediction. Simul Model Pract Theory 55:1–9
Enforcing position-based confidentiality with machine learning Zhang Z (2016) Introduction to machine learning: k-nearest neighbors.
paradigm through mobile edge computing in real-time industrial Ann Transl Med 4(11):218–218
informatics. IEEE Trans Industr Inf 15(7):4189–4196 Zhao B, Takasu A, Yahyapour R, Fu X (2019) Loyal consumers or
Santharam A, Krishnan SB (2018) Survey on customer churn predic- one-time deal hunters: repeat buyer prediction for e-commerce.
tion techniques. Int Res J Eng Tech 5(11):131–137 In 2019 International conference on data mining workshops
Schapire RE (2013) Explaining AdaBoost. Empirical inference. (ICDMW). IEEE
Springer, Berlin Heidelberg, pp 37–52
Sweilam NH, Tharwat A, Moniem NA (2010) Support vector machine Publisher's Note Springer Nature remains neutral with regard to
for diagnosis cancer disease: a comparative study. Egypt Inform jurisdictional claims in published maps and institutional affiliations.
J 11(2):81–92
Ullah I, Raza B, Malik AK, Imran M, Islam SU, Kim SW (2019) A
churn prediction model using random forest: analysis of machine
13