Customer Personality Analysis For Churn Prediction Using Hybrid Ensemble Models and Class Balancing Techniques
Customer Personality Analysis For Churn Prediction Using Hybrid Ensemble Models and Class Balancing Techniques
ABSTRACT Today’s businesses rely heavily on focused marketing to improve their chances of growing and
keeping their consumer base. Internet behemoths like Google and Facebook have expanded their business
models around targeted advertisements that support business growth. Customer personality identification
helps for churn prediction for companies. This problem arises in several companies where customer leaves
companies for many reasons. This gap leads to conduct study for customer personality analysis. The collected
dataset was highly imbalanced in nature. Two class balancing approaches CTGAN (Conditional tabular
Generative adversarial networks) and SMOTE (Synthetic minority oversampling technique) has been utilized
to equalize the both classes. There are three ensemble approaches such as bagging, boosting and stacking
have been utilized for modeling purpose bagging approach uses Random Forest (RF) boosting utilizes
XGBoost (XGB), Light Gradient Boosting Machine (LGBM) and ADA Boost (ADA B). The proposed
Hybrid Model HSLR comprises of RF, XGB, ADA Boost, LGBM approaches as base classifiers and LR as
a Meta classifier. Three testing independent set, k-fold with 5 and 10 folds have been utilized. To evaluate
the performance of classifiers evaluation metrics such as Accuracy score, Precision, Recall, F1 score, MCC
and ROC score have been utilized. The SMOTE generated data has shown results as compare with CTGAN
generated data. The SMOTE approach has shown the highest results of 94.06, 94.23, 94.28, 94.05, 88.13 and
0.984 as accuracy score, Precision, recall, F1, MCC and Roc score respectively.
INDEX TERMS Customer personality analysis, machine learning, generative adversarial networks,
SMOTE.
2023 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 1865
N. Ahmad et al.: Customer Personality Analysis for Churn Prediction
Technological advancements have ushered in a new era V. Employing a range of metrics such as accuracy score,
where automation streamlines the collection, modeling, precision, recall, F1 score, MCC, and ROC score for a com-
and evaluation of data, democratizing customer person- prehensive evaluation of the model’s performance.
ality research for businesses of varying scales [5]. This VI. Conducting a comparative analysis of the data gener-
study delves deep into the multifaceted process of cus- ated through CTGAN and SMOTE, offering insights into the
tomer personality analysis, a technique that scrutinizes the relative efficacy of these class balancing techniques.
ideal customer profile through a rich dataset encompass- The study unfolds in successive sections, each delineat-
ing variables such as age, educational background, marital ing crucial aspects of the research from a comprehensive
status, parental status, income brackets, and expenditure literature review to a detailed exposition of the dataset and
patterns across various products. By harnessing this data- the innovative approach adopted, followed by an analyti-
driven approach, businesses can foster strategies that resonate cal discourse on the experimental results. The penultimate
with the needs and aspirations of their clientele, facilitating section engages in a critical discussion on the achieved results
informed decision-making and fostering a competitive edge juxtaposed against existing studies, paving the way for the
in the social media market [6]. conclusion which delineates potential avenues for future
Central to this study is the exploration of synthetic data research.
generation to pinpoint customer personality traits, a venture
that encompasses a series of meticulous steps including data
preparation and cleansing. The research leverages class bal- II. RELATED WORK
ancing techniques such as CTGAN and SMOTE, alongside The literature on customer personality classification and
ensemble approaches including bagging, boosting, and stack- churn prediction primarily focuses on machine learning and
ing, to enhance the analytical depth. deep learning techniques. This section categorizes studies
This research situates itself in this critical juncture, aiming based on their methodologies and focus. It also highlights
to bridge the gap between data accumulation and strate- their limitations, setting the stage for this study’s proposed
gic application through customer personality analysis. The solution.
problem of efficiently and accurately analyzing customer Many studies used machine learning algorithms like
personality is pressing, holding the key to unlocking a more logistic regression, decision trees, and SVM to predict
personalized, responsive, and successful business strategy. customer churn. However, these studies lacked a detailed
By delving into advanced analytical techniques such as comparative analysis of algorithm performance. They also
CTGAN and SMOTE for class balancing, and exploring focused on specific industries or regions, limiting their
ensemble approaches for data analysis, this study seeks to broader applicability [7], [8].
offer a robust solution to a problem that stands at the heart One study analyzed unstructured call log data for churn
of modern business strategy. prediction. It combined text mining and machine learning
The study aims to pave the way for businesses to fos- techniques. Yet, it had a restricted dataset and lacked a
ter deeper connections with their customers, drive customer detailed algorithm comparison [9].
loyalty, and secure a sustainable competitive advantage in Another study used digital twins to identify personality
a fiercely competitive market. By fostering a deeper under- traits. It employed CNN for feature extraction and RNN for
standing of customer personalities, this research equips classification. But, its small dataset and limited trait focus
businesses with the knowledge to craft products and services could affect its findings’ generalizability [10].
that resonate with specific customer segments, augmenting A study on customer buying behaviors used a Multi-Layer
customer loyalty and expanding market share. It stands as a Perceptron (MLP) neural network. It had a small dataset and
testament to the indispensable role of customer personality lacked a deep learning algorithm comparison [11].
analysis in steering businesses towards sustained growth and Zhao et al. analyzed online product reviews’ sentiment
relevance in a perpetually evolving market landscape. using Naive Bayes and SVM classifiers. The study offered
The contributions of this study are manifold, including a new sentiment analysis perspective but lacked a machine
the utilization of a rich array of analytical tools and testing learning algorithm comparison [12].Utami et al. categorized
methodologies: DISC personality using Bahasa Indonesian Twitter data. Its
I. Utilizing CTGAN and SMOTE techniques to balance single-language focus and limited dataset could restrict its
the imbalanced dataset, enhancing the reliability of the churn broader applicability [13].
predictions. Another study predicted personality traits from social
II. Combining multiple base classifiers (RF, XGB, LGBM, media content. It used topic modeling and SVM classifiers
and ADA B) to create a robust analytical framework. but focused on limited personality traits [14].
III. Introducing Logistic Regression as a meta-classifier to A study used machine learning to identify client interaction
refine the predictive accuracy further. decision-making styles. It employed a decision tree approach
IV. Implementing diverse testing paradigms including but lacked a machine learning algorithm assessment [15].
independent set testing and k-fold testing (with 5 and One study discussed deep learning models for personality
10 folds) to ensure a robust evaluation of the model. trait prediction. It lacked a detailed deep learning algorithm
comparison [16]. The study on personality classification used Both CTGAN and SMOTE play crucial roles in rectifying
clustering, decision tree, and SVM algorithms. Its limited unbalanced class distribution, thereby improving classifier
dataset could affect its findings’ generalizability [17]. performance. Figure 1 provides a visual representation of the
The study focuses on developing an effective customer proposed architecture for customer personality classification.
churn prediction (CCP) model, named DFE-WUNB, which
operates in a cloud-computing environment. This model
leverages deep feature extraction with Artificial Neural Net-
works (ANN) to handle the complex, non-linear features of
the Telco customer churn dataset. The DFE-WUNB model
demonstrates a higher accuracy in churn prediction compared
to conventional models [18].
The study underscores the integration of AI and ML in
CRM tools, emphasizing the significance of churn prediction
in the banking sector. The research highlights the chal-
lenges of processing heterogeneous data for optimal churn
prediction [19]. The study introduces intelligent decision
forest (DF) models for churn prediction, focusing on the
Logistic Model Tree (LMT), Random Forest (RF), and Func-
tional Trees (FT), including their enhanced versions based on
weighted soft voting and stacking methods. The proposed DF
models effectively differentiate between churn and non-churn
customers, even in imbalanced scenarios, and have shown
superior performance compared to existing ML-based meth-
ods. The study suggests these DF models as optimal solutions
for customer churn prediction in telecommunications [20].
In contrast, this research proposes a hybrid ensemble
model. It uses advanced class balancing techniques to address
previous works’ limitations. The approach combines various
machine learning and deep learning techniques for customer
personality analysis and churn prediction.
Previous studies on customer personality analysis relied on
traditional statistical methods. These methods often missed
the multifaceted nature of consumer behavior. There was
also a gap in using diverse variables influencing consumer
personality. Many studies had a narrow focus. Additionally,
FIGURE 1. Proposed architecture for customer personality classification.
they didn’t use ensemble approaches, which offer a nuanced
understanding. This study addresses these gaps, introducing
A. DATASET DESCRIPTION
a comprehensive customer personality analysis approach.
Table 1 lists benchmark studies for customer personal- The dataset collected represents the Consumer Personality
ity classification. Many researchers also used classification, Analysis, a technique used to identify a company’s ideal
regression, and clustering methods. customers. It consists of 2240 samples, with 1906 samples
belonging to the negative class and 334 samples to the pos-
itive class [21]. TABLE 2 details the sample distribution
III. METHODOLOGY before the implementation of class-balancing approaches.
The study employed CTGAN and SMOTE for data bal- The dataset includes customer information such as birth
ancing. Four base classifiers were used: Random Forests year, education, marital status, whether they have children,
(RF), XGBoost (XGB), AdaBoost (ADA B), and Light- income, and several other attributes. The dataset was bal-
GBM (LGBM). These classifiers served as base learners. anced using the SMOTE and CTGAN approaches, resulting
Additionally, Logistic Regression (LR) was utilized as a in an equal number of samples for both classes. TABLE 3
meta-classifier. displays the sample distribution after the application of class-
The purpose of the LR meta-classifier was to aggregate balancing techniques.
predictions from the base classifiers. This aggregation was
achieved using a stacking ensemble method. To address the B. DATASET PREPROCESSING
unbalanced class distribution in the training dataset, synthetic Data preprocessing in this study encompassed several crucial
data was generated with the CTGAN technique. Furthermore, steps:
the minority class underwent oversampling using SMOTE. 1. The dataset was imported via Google Colab.
This research is anchored in the seamless integration of The objective of CTGAN is to provide artificial data that
powerful classifiers, complemented by state-of-the-art class can balance unbalanced classes in a dataset. Machine learning
balancing techniques. This combination aims to provide algorithms that depend on balanced data can perform better
a comprehensive understanding of customer personalities. by employing CTGAN. In data science and machine learn-
Subsequent sections provide an in-depth exploration of the ing, the CTGAN technique is frequently used to overcome
methodology, highlighting the synergy of the individual com- issues with class imbalance. It has been demonstrated that
ponents in the proposed solution. the method is efficient at producing fake data that closely
The foundation of this solution lies in the strategic resembles the distribution of real data. Many applications,
selection of classifiers, renowned for their consistent perfor- such as fraud detection, medical diagnosis, and credit scor-
mance across diverse scenarios. The adoption of CTGAN ing, can make use of CTGAN. Overall, CTGAN provides a
and SMOTE as class balancing techniques was driven by powerful tool for addressing the challenge of imbalanced data
their demonstrated success in rectifying class imbalances, in machine learning. TABLE 4. shows the hyper-parameters
thus optimizing classifier outcomes. Each classifier within and respective values for CTGAN model.
the ensemble contributes its unique expertise, collectively
enhancing the overall predictive accuracy. TABLE 4. Setting for the CTGAN.
In the sections that follow, a detailed exposition of each
solution component is presented, clarifying the technical
intricacies that drive their functionality. From the nuances
of synthetic data generation using CTGAN and SMOTE to
the intricate operations of each classifier in the ensemble,
a comprehensive overview of the methodological framework
is provided.
1) CTGAN
CTGAN is a Generative Adversarial Network (CTGAN)- 2) SMOTE
based approach used to produce synthetic data to balance A common approach for producing synthetic data to balance
imbalanced classes in a dataset. The generator transfers the unbalanced classes in a dataset is called SMOTE (Synthetic
original data into a latent space and produces synthetic Minority Over-Sampling Technique). The algorithm creates
samples from it using an encoder-decoder architecture [22]. new minority class instances based on the minority class
In order to generate synthetic samples that are closer to the instances that already exist. The minority class examples
actual data, the generator is trained to minimize the loss that are close to one another in the feature space are found
function: Figure 2 shows the CTGAN architecture used to by SMOTE using the k-nearest neighbor’ algorithm. The
conduct this study. algorithm then generates new synthetic examples by extrap-
LG = {log (1 − D (G (Z )))} (1) olating between instances of the minority class and their
k-nearest neighbors [23]. Figure 3. shows the SMOTE archi-
To differentiate between the simulated and real data, the tecture for customer personality classification.
discriminator is trained to optimize the loss function: Using the following equation, interpolation is carried out:
LD = {− log (D (X )) − log (1 − D (G (Z )))} (2) New Sample = MI + (RM × (NN − MI )) (3)
1) RANDOM FOREST
Random Forest is an ensemble learning method that combines
different decision trees to improve the overall performance of
the model. A bagging method involves partitioning the data
up into smaller subsets and training a decision tree on each
subset [24]. All of the majority-approved decision trees in
the forest provide the final prediction against the specified
test sample. Each decision tree in a Random Forest is trained
using a random selection of data points and replaced using
a technique called bootstrap aggregating, also referred to as
bagging. Also, for each split in the decision tree, a random
subset of qualities is chosen to be taken into account rather
than all characteristics. This improves the model’s generaliz-
ability and reduces overfitting.
Figure 4. States the bagging approach followed to conduct
this research. Random Forest classifier has been utilized in
FIGURE 3. SMOTE approach for synthetic Dataset. bagging approach.
challenging data. The following equation updates the weight a number of solutions for parallel and distributed computing
of an observation, indicated by wi. that can further reduce training time while also handling
(1 − errori ) high-dimensional data and categorical features with ease.
wi = (0.5) × ln errori (4)
4) XGBOOST
where errori is the base classifier’s iteration misclassification
An open-source gradient boosting system with a focus on
rate.
efficiency and scalability is called XGBoost (eXtreme Gra-
The base classifier is once again trained using the new
dient Boosting) [27]. Like other gradient boosting methods,
weights after each observation’s weight has been updated.
XGBoost trains an ensemble of decision trees by repeatedly
The predictions of all the basic classifiers are combined to
partitioning the feature space and training a decision tree on
produce the final prediction, with the accuracy of each clas-
the partitioned subspace. The basic idea behind XGBoost is
sifier determining the weight of its contribution. Formally, the
to optimize the objective function by adding new trees to
following makes the final prediction:
X the ensemble. The objective function is a measure of the
f (x) = sign i = ln αi h (x) (5) model’s performance, and it can be different for classification
and regression problems. For classification problems, the
where h(x) is the ith classifier’s prediction and I is the ith objective function is usually the log-loss function, which is
classifier’s weight. defined as:
AdaBoost is a potent ensemble method that is widely used X
L (y, f (x) = − 1 n i = 1n yi × log (f (xi ) + (1 − yi )
in a variety of industries, including computer vision, natural
language processing, and bioinformatics. It can boost the × log (1 − f (xi ))
(7)
performance of a weak classifier by lowering its bias and
variance. Additionally, it is computationally effective and In this scenario, n is the number of observations, y i denotes
simple to implement. Since AdaBoost is sensitive to noisy the accurate label, and f(x_i) denotes the anticipated prob-
data and outliers, pre-processing the data is essential before ability. Typically, the mean squared error serves as the loss
using it. function in regression issues.
X
L (y, f (x)) = 1 n i = 1n yi − f (xi )2 (8)
3) LGBM
A gradient boosting framework called LightGBM (Light XGBoost using a gradient-based optimization algorithm
Gradient Boosting Machine) makes use of tree-based learn- as opposed to more conventional techniques like exhaustive
ing techniques. It is intended to be effective and scalable, search or approximate algorithms quickly discovers the ideal
making it suitable for big datasets and features with many of split point. Additionally, it employs a method known as
dimensions [25]. regularization to lessen overfitting and enhance the model’s
In order to build a tree-based model using LightGBM, generalizability. The objective function includes the regular-
the feature space is repeatedly divided into smaller sub- ization term, which is defined as follows:
X
spaces, and a decision tree is trained on each subspace. L (y, f (x)) + γ T + γ w2i (9)
By determining the optimal split point for each feature in
terms of a loss function, the partitioning procedure is car- where wi is the weight of the ith feature, is the complexity
ried out. The best-split point is found by LightGBM using parameter, and is the L2 regularization term is regarded as
a gradient-based optimization algorithm, which is quicker one of the most potent and extensively used machine learning
than more conventional approaches like exhaustive search algorithms and is noted for its quick training time and high
or approximate algorithms [26]. ‘‘Gradient-based One-Side predicted accuracy. It can easily handle categorical features
Sampling’’ (GOSS), a variation of the conventional gradient and high-dimensional data, and it offers a variety of parallel
descent technique, is the name of the gradient-based opti- and distributed processing options that can reduce training
mization algorithm employed by LightGBM. For each split, time even further.
GOSS chooses a random subset of data points using a method
5) LOGISTIC REGRESSION
known as ‘‘one-sided sampling,’’ which lowers the compu-
tational cost of the optimization procedure. The predictions For classification issues, supervised learning algorithms like
of all the decision trees in the forest are averaged to get logistic regression are used. Logistic regression’s fundamen-
the final prediction. Formally, the following makes the final tal goal is to simulate the likelihood of a binary outcome
prediction: (such as success or failure, 1 or 0) given a set of input data.
X A probability between 0 and 1 that can be understood as
f (x) = i = 1n fi (x) (6) the likelihood of the positive class is the model’s output.
Using the logistic function, also called the sigmoid function,
where f_i(x) is the prediction of the ith decision tree and
logistic regression mathematically models the likelihood of
n is the total number of decision tree. Large-scale machine
the positive class:
learning tasks frequently use LightGBM because of its short
training time and good predicted accuracy. LightGBM offers p(y = 1|x) = 1/(1 + e ∧ (−w ∧ Tx − b)) (10)
Input:
- Training dataset: D_train = {(x1, y1), (x2, y2), . . . , (xn,
yn)}
- Testing dataset: D_test = {x1’, x2’, . . . , xm’}
- Base learners: BL = {Random Forest (RF), XGBoost
(XGB), LightGBM (LGBM), ADA Boost (ADA)}
- Meta learner: Logistic Regression (LR)
FIGURE 5. Bagging approach used for proposed system.: States the
Procedure:
bagging approach followed to conduct this research. random forest 1. For each base learner bl in BL do
classifier has been utilized in bagging approach. 1.1 Train bl on D_train to get the trained model M_bl
1.2 Use M_bl to predict the labels of D_train, store the
6) HYBRID STACKING BASED LOGISTIC REGRESSION (HSLR) predictions as P_bl_train
The four base learners (ADA Boost, XGBoost, Random 1.3 Use M_bl to predict the labels of D_test, store the
Forests, and LightGBM) has been used in stacking ensemble predictions as P_bl_test
classifier with a logistic regression meta-learner are trained 2. Combine the predictions P_bl_train from all base learn-
on the input data and their predictions are used as features ers to form a new feature matrix F_train for D_train
for the meta-learner. A final prediction is then made by the 3. Combine the predictions P_bl_test from all base learners
meta-learner by combining the predictions of the base learn- to form a new feature matrix F_test for D_test
ers. Because it can combine the benefits of several models 4. Train the meta learner LR on F_train using the true labels
while minimizing their drawbacks, the stacking ensemble from D_train to get the trained meta model M_LR
method is efficient. In the stacking ensemble method, the 5. Use M_LR to predict the labels of F_test, store the
LR meta-classifier trained to blend the predictions from the predictions as P
basis classifiers as input features to produce a final prediction. 6. Return P
FIGURE 6 shows HSLR (RF, LGBM, XGB, ADA), (LR). Output:
The HSLR architecture has been shown in figure 6. Four - Predictions on the testing dataset: P = {y1’, y2’, . . . , ym’}
classifiers such as RF, LGBM, XGB, and ADA boost have
then contrasted with the test set’s actual labels. The perfor-
mance of the model is measured using common assessment
measures including accuracy, precision, recall, F1-score, and
AUC-ROC. TABLE 6 shows CTGAN generated data inde-
pendent set testing results
FIGURE 10. ROC from SMOTE generated data using independent set
testing. Five-Fold cross validation.
a: FOR CTGAN
Figure 11. has illustrated the confusion matrix for the highest
accuracy obtained. The RF has highest MCC score and CM FIGURE 13. CM for SMOTE generated data using 5 fold CV.
has drawn by using RF evaluation. The CM has obtained by
using cross_val_predict from sklearn library. The predicted
label has been attained against each sample by using 5 fold
CV. Figure 12. demonstrates the ROC obtained from 5 fold
cross validation for CTGAN architecture. ADA Boost has
outperformed other approaches with score of 0.95.
of our experiments. The 10-fold cross-validation describes TABLE 11. exhibits the results obtained from 10-fold
the process of splitting a dataset into 10 equal ‘‘folds.’’ The Cross-validation by using SMOTE architecture.
data is divided into 10 parts for 10-fold cross-validation, with
9 parts utilized for training and 1 part for testing. Each of TABLE 11. Results for SMOTE generated data using 10-Fold CV.
the 10 sections is used as the test set exactly once during
the course of this procedure’s ten repetitions. To provide a
final assessment of model performance, the results of each
test are then summed. Ten-fold cross-validation, which pro-
vides a more accurate assessment of model performance
than judging on a single train/test split, is a commonly used
technique for assessing the performance of machine learning
models. TABLE 10. shows the results attained from 10-fold
cross-validation by using CTGAN, where boosting approach The bagging approach RF has shown best results with
LGBM has outperformed other existing approaches. MCC score of 87.57.
Figure 17. exhibits the ROC attained from 10-Fold cross
TABLE 10. Results by CTGAN Generated Data for 10-Fold CV. validation for SMOTE architecture, where RF has outclassed
remaining approaches with score of 0.981.
FIGURE 17. ROC for SMOTE generated data using 10-Fold CV.
V. DISCUSSION
In this section, a comparative analysis is presented between
the proposed solution and existing state-of-the-art studies.
The emphasis is on highlighting the enhanced perfor-
FIGURE 16. ROC for 10 fold using CTGAN generated data. mance of the proposed methodology across various metrics.
TABLE 12. Comparison with state-of-the art studies. unstructured data and might offer enhanced predictive accu-
racy. As the field of customer personality analysis evolves,
there’s scope to explore additional features and attributes that
might influence churn prediction. Advanced feature engineer-
ing techniques can be employed to extract more meaningful
insights from the data.
With the exponential growth of data in today’s digital
age, ensuring that the proposed methodologies are scalable
becomes paramount. Future research can focus on optimizing
the current architecture to handle vast datasets efficiently,
possibly integrating distributed computing frameworks like
Apache Spark. One of the potential areas of exploration is the
Table 12 offers a detailed comparison with recent studies, development of real-time churn prediction systems, providing
showcasing the effectiveness of the proposed predictor, which businesses with immediate insights and allowing them to take
surpasses other methodologies. Notably, the table highlights proactive measures to retain customers. The current study,
the improved performance metrics achieved using data gen- while focused on a specific industry or domain, leaves room
erated by SMOTE compared to data generated by GAN. for exploration of the applicability of the proposed method-
The initial dataset had a significant class imbalance, with an ologies across different industries, understanding the nuances
unequal distribution of positive and negative classes. Such and challenges unique to each.
an imbalance can lead machine learning or deep learning As synthetic data generation techniques become more
models to be inherently biased towards the majority class, prevalent, addressing ethical considerations related to data
often skewing predictive accuracy. privacy and usage will be crucial. Future research can delve
To address this, two cutting-edge class-balancing tech- into developing frameworks that ensure the ethical genera-
niques were employed: SMOTE and GAN, with a specific tion and use of synthetic data. The integration of traditional
focus on the CTGAN architecture for synthetic data genera- statistical methods with machine learning and deep learning
tion. This strategy not only balanced the dataset, improving algorithms can lead to the development of hybrid models,
the reliability of the predictive model but also provided a offering a more holistic view of customer behavior. Based
deeper insight into the data patterns on the insights derived from customer personality analysis,
Our analysis revealed a discernibly better performance future research can also focus on devising personalized mar-
with The analysis indicates that SMOTE outperforms keting strategies tailored to individual customer preferences,
CTGAN in addressing the dataset imbalance. While CTGAN enhancing engagement and loyalty. Incorporating a feedback
is adept at generating synthetic data, it sometimes struggles loop mechanism can ensure that the models are continuously
to capture complex patterns and relationships present in real updated based on real-world performance, leading to more
data, particularly in cases of significant data imbalance and adaptive and resilient prediction systems. In conclusion, the
high-dimensional datasets. In contrast, SMOTE creates data field is ripe for further exploration, with research in this
points that mirror existing entries, providing a truer represen- domain playing a pivotal role in shaping customer-centric
tation of the actual data. Its straightforward application across strategies and ensuring sustained growth.
diverse datasets makes it the preferred choice for this study.
The proposed hybrid model, HSLR, incorporates machine- VI. CONCLUSION
learning classifiers such as RF, XGB, ADA Boost, and This study tackled the challenge of class imbalance in
LGBM as base learners, with LR acting as a meta-classifier. machine learning models by employing CTGAN and
This combination of algorithms capitalizes on the strengths of SMOTE to generate synthetic data. The results indicated
each classifier, resulting in a model with superior predictive SMOTE’s superiority over CTGAN in terms of various
accuracy and dependability. In summary, this study presents performance metrics. The dataset used exhibited class imbal-
a significant advancement in addressing imbalanced datasets, ances, which can bias machine learning models towards
demonstrating a predictive model that excels in comparison the majority class. This issue was addressed by generating
to existing state-of-the-art studies. The achieved performance synthetic data using SMOTE and CTGAN. The proposed
metrics validate the effectiveness of the approach, suggesting HSLR model utilized various classifiers, and its performance
promising avenues for future research in this area. was evaluated using metrics such as accuracy score, preci-
The findings of this study, while promising, open up sion, recall, F1 score, MCC, and ROC score. The SMOTE
several avenues for future research and exploration in the approach yielded the highest results, outperforming exist-
realm of customer personality analysis and churn predic- ing methods. Future plans include collecting more data and
tion. One potential area of exploration is the integration exploring deep neural architectures like FCN, CNN, LSTM,
of deep learning algorithms, such as Convolutional Neural and GRU. This study’s findings offer insights into the applica-
Networks (CNNs) and Recurrent Neural Networks (RNNs). tion of machine learning and deep learning in addressing class
These algorithms can be particularly effective in handling imbalance issues, with potential applications in domains like
healthcare, finance, and security. The code is available on the [14] M. Hassanein, W. Hussein, S. Rady, and T. F. Gharib, ‘‘Predicting person-
GitHub repository: https://github.com/mazhar786/Customer- ality traits from social media using text semantics,’’ in Proc. 13th Int. Conf.
Comput. Eng. Syst. (ICCES), Dec. 2018, pp. 184–189.
personality-. [15] A. A. Tudoran, ‘‘A machine learning approach to identifying decision-
The notational Table 13 of each abbreviation used is as making styles for managing customer relationships,’’ Electron. Markets,
below vol. 32, no. 1, pp. 351–374, Mar. 2022.
[16] R. Hegde, S. K. Hegde, S. Kotian, and S. C. Shetty, ‘‘Personality classifi-
cation using data mining approach,’’ Int. J. Res. Anal. Rev., vol. 354, no. 1,
TABLE 13. Notational table. pp. 354–359, 2019.
[17] A. Sharma, A. Pratap, K. Vyas, and S. Mishra, ‘‘Machine learning
approach: Consumer buying behavior analysis,’’ in Proc. IEEE Pune Sect.
Int. Conf. (PuneCon), Dec. 2022, pp. 1–10.
[18] S. Arockia Panimalar and A. Krishnakumar, ‘‘Customer churn predic-
tion model in cloud environment using DFE-WUNB: ANN deep feature
extraction with weight updated tuned Naïve Bayes classification with
block-jacobi SVD dimensionality reduction,’’ Eng. Appl. Artif. Intell.,
vol. 126, Nov. 2023, Art. no. 107015.
[19] S. C. K. Tékouabou, Ş. C. Gherghina, H. Toulni, P. N. Mata, and
J. M. Martins, ‘‘Towards explainable machine learning for bank churn pre-
diction using data balancing and ensemble-based methods,’’ Mathematics,
vol. 10, no. 14, p. 2379, Jul. 2022.
[20] F. E. Usman-Hamza, A. O. Balogun, L. F. Capretz, H. A. Mojeed,
S. Mahamad, S. A. Salihu, A. G. Akintola, S. Basri, R. T. Amosa, and
N. K. Salahdeen, ‘‘Intelligent decision forest models for customer churn
prediction,’’ Appl. Sci., vol. 12, no. 16, p. 8270, Aug. 2022.
[21] Kaggle: Your Machine Learning and Data Science Community. Customer
Personality Analysis. Accessed: Apr. 10, 2022. [Online]. Available:
REFERENCES
https://www.kaggle.com/datasets/imakash3011/customer-personality-
[1] L. Marin, S. Ruiz, and A. Rubio, ‘‘The role of identity salience in the effects analysis
of corporate social responsibility on consumer behavior,’’ J. Bus. Ethics, [22] L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni, ‘‘Mod-
vol. 84, no. 1, pp. 65–78, Jan. 2009. eling tabular data using conditional GAN,’’ in Proc. Adv. Neural Inf.
[2] E. Esenogho, I. D. Mienye, T. G. Swart, K. Aruleba, and G. Obaido, Process. Syst., vol. 32, 2019, pp. 1–12.
‘‘A neural network ensemble with feature engineering for improved [23] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, ‘‘SMOTE:
credit card fraud detection,’’ IEEE Access, vol. 10, pp. 16400–16407, Synthetic minority over-sampling technique,’’ J. Artif. Intell. Res., vol. 16,
2022. pp. 321–357, Jun. 2002.
[3] S. A. Ebiaredoh-Mienye, E. Esenogho, and T. G. Swart, ‘‘Artificial neu- [24] L. Breiman, ‘‘Random forests,’’ Mach. Learn., vol. 45, no. 1, pp. 5–32,
ral network technique for improving prediction of credit card default: 2001.
A stacked sparse autoencoder approach,’’ Int. J. Electr. Comput. Eng., [25] Y. Freund and R. E. Schapire, ‘‘Experiments with a new boosting
vol. 11, no. 5, p. 4392, Oct. 2021. algorithm,’’ in Proc. ICML, 1996, pp. 148–156.
[4] H. M. F. Shehzad, A. Yasin, Z. K. Ansari, M. A. Khan, and M. J. Awan, [26] J. Fan, X. Ma, L. Wu, F. Zhang, X. Yu, and W. Zeng, ‘‘Light gradient
‘‘Fake profile recognition using big data analytics in social media plat- boosting machine: An efficient soft computing model for estimating daily
forms,’’ Int. J. Comput. Appl. Technol., vol. 68, no. 3, p. 215, 2022. reference evapotranspiration with local and external meteorological data,’’
[5] C. Leuz, ‘‘Evidence-based policymaking: Promise, challenges and oppor- Agricult. Water Manage., vol. 225, Nov. 2019, Art. no. 105758.
tunities for accounting and financial markets research,’’ Accounting Bus. [27] L. Torlay, M. Perrone-Bertolotti, E. Thomas, and M. Baciu, ‘‘Machine
Res., vol. 48, no. 5, pp. 582–608, Jul. 2018. learning–XGBoost analysis of language networks to classify
[6] M. J. Awan, M. S. M. Rahim, H. Nobanee, A. Munawar, A. Yasin, and patients with epilepsy,’’ Brain Informat., vol. 4, no. 3, pp. 159–169,
A. M. Z. Azlanmz, ‘‘Social media and stock market prediction: A big Sep. 2017.
data approach,’’ Comput., Mater. Continua, vol. 67, no. 2, pp. 2569–2583, [28] I. Düntsch and G. Gediga, ‘‘Confusion matrices and rough set data analy-
2021. sis,’’ J. Phys., Conf. Ser., vol. 1229, no. 1, May 2019, Art. no. 012055.
[7] K. Chaudhary, M. Alam, M. S. Al-Rakhami, and A. Gumaei, ‘‘Machine
learning-based mathematical modelling for prediction of social media
consumer behavior using big data analytics,’’ J. Big Data, vol. 8, no. 1,
pp. 1–20, Dec. 2021.
[8] S. Kumar, ‘‘A survey on customer churn prediction using machine learning
techniques,’’ Int. J. Comput. Appl., vol. 154, no. 10, pp. 13–16, Nov. 2016.
[9] N. N. Y. Vo, S. Liu, X. Li, and G. Xu, ‘‘Leveraging unstructured call
log data for customer churn prediction,’’ Knowl.-Based Syst., vol. 212,
Jan. 2021, Art. no. 106586.
[10] J. Sun, Z. Tian, Y. Fu, J. Geng, and C. Liu, ‘‘Digital twins in human
understanding: A deep learning-based method to recognize personality
traits,’’ Int. J. Comput. Integr. Manuf., vol. 34, nos. 7–8, pp. 860–873,
Aug. 2021. NOMAN AHMAD received the master’s degree
from the esteemed University of Management and
[11] N. Chaudhuri, G. Gupta, V. Vamsi, and I. Bose, ‘‘On the platform but will
they buy? Predicting customers’ purchase behavior using deep learning,’’ Technology. He is currently a distinguished author
Decis. Support Syst., vol. 149, Oct. 2021, Art. no. 113622. in the realm of computer science. Renowned as
[12] H. Zhao, Z. Liu, X. Yao, and Q. Yang, ‘‘A machine learning-based sen- a Versatile Solution Provider, he has garnered
timent analysis of online product reviews with a novel term weighting recognition for his exceptional technical prowess
and feature selection approach,’’ Inf. Process. Manage., vol. 58, no. 5, and adept problem-solving skills, consistently aid-
Sep. 2021, Art. no. 102656. ing a myriad of clients worldwide. With an
[13] E. Utami, I. Oyong, S. Raharjo, A. Dwi Hartanto, and S. Adi, ‘‘Supervised entrepreneurial spirit that knows no bounds, he is
learning and resampling techniques on DISC personality classification unrelenting in his pursuit of knowledge, striving
using Twitter information in bahasa Indonesia,’’ Appl. Comput. Informat., to elevate his expertise within the realms of artificial intelligence, machine
vol. 2021, pp. 1–11, Sep. 2021. learning, and data sciences.
MAZHAR JAVED AWAN received the Master AZLAN MOHD ZAIN (Member, IEEE) received
of Science degree in computer science from the the Ph.D. degree in computer science from Uni-
University of Central Punjab (UCP), Lahore, the versiti Teknologi Malaysia (UTM), in 2010. He is
master’s degree in computer science from COM- currently a Professor with the Faculty of Com-
SATS Lahore, and the Ph.D. degree from Univer- puting, UTM. As an Academic Staff, he has
siti Teknologi Malaysia (UTM) pertains to medical successfully supervised more than 25 postgradu-
image detection, a testament to his vast knowledge ate students and received more than 20 research
and expertise. He is currently an accomplished grant funding to support research students. He has
Assistant Professor with the Software Engineering published more than 100 research papers. He has
Department, esteemed University of Management been invited as a keynote speaker at over five
and Technology (UMT) Lahore, brings forth a wealth of 20 years of diverse international conferences, serves on numerous committees, and has served
experience in a multitude of academic institutions. In addition to his aca- on editorial board for several international journals.
demic pursuits, he also holds the title of a highly sought-after trainer,
a consultant, and a curriculum member in the field of artificial intelligence,
serving a range of academic, government, and corporate sectors. As a prolific
Researcher, he boasts a formidable publication record, with over 55 research
papers in high-impact factor journals and top conferences in the fields of
artificial intelligence, data sciences, big data, deep learning, natural language
processing, and machine learning. His Google Scholar H-index of 28, with
2000 citations, is further testament to his exceptional impact on the field. ANSAR NASEEM is currently pursuing the
Recently, he was honored with a recognition as one of the top 2% globally Graduate degree with the University of Man-
influential scientists by Stanford University, in October 2022. He has been agement and Technology (UMT). His research
a sought-after keynote speaker, invited to numerous prestigious institutes in interests include machine learning, deep learning
Pakistan and Malaysia, and he has served as a judge in AI competitions. He is with focus on nature language processing, and
an active member of the IEEE Lahore Section Pakistan. bioinformatics. He has several publications under
his name.