JCP 05 00009
JCP 05 00009
Computer, Networks, Modeling, and Mobility Laboratory (IR2M), Faculty of Sciences and Techniques,
Hassan First University of Settat, Settat 26000, Morocco; [email protected]
* Correspondence: [email protected]
Academic Editor: Danda B. Rawat Keywords: fraud detection; generative models; variational autoencoder; generative
Received: 13 February 2025 adversarial neural network; class imbalance; autoencoder; deep learning
Revised: 14 March 2025
Accepted: 14 March 2025
Published: 17 March 2025
Financial institutions are tasked with the critical challenge of quickly and accurately
identifying and isolating fraudulent transactions while maintaining a smooth customer
experience. “Quickly” emphasizes the need for a detection model that minimizes delays,
protecting both customers and institutions from potential issues. Meanwhile, “accurately”
highlights the importance of precise fraud detection, as false positives can lead to unneces-
sary resource allocation [3]. Traditionally, fraud detection methods, such as manual review
or rule-based models, have shown limited effectiveness. Manual detection is slow, requir-
ing a long time to conclude, while rule-based approaches involve complex rules that must
be applied and assessed before a transaction can be labeled as suspicious [4]. Both meth-
ods demand significant effort to establish criteria for identifying fraudulent transactions
and struggle to detect new, unknown, and sophisticated fraud patterns. For this reason,
financial institutions spend a lot of money searching for powerful techniques to prevent
fraudulent transactions with higher accuracy by employing artificial intelligence. AI-driven
fraud detection systems provide unmatched speed, efficiency, and adaptability [5]. Machine
learning models (ML), such as Logistic Regression, Decision Tree, Random Forest, Gradient
Boosting, XGBoost, LightGBM, K-Nearest Neighbors, Naive Bayes, AdaBoost, and Bagging
Classifier, offer a range of solutions for detecting fraudulent transactions [6]. These models
are effective at handling various data patterns and can be used individually or combined to
enhance fraud detection capabilities [7]. Likewise, deep learning techniques (DL), including
Long Short-Term Memory networks, Artificial Neural Networks, and Recurrent Neural
Networks, further strengthen these solutions by analyzing large datasets and uncovering
intricate patterns [8]. The integration of these ML and DL models into hybrid algorithms
provides a comprehensive approach to fraud detection. Hybrid models leverage diverse
techniques to improve detection accuracy, reduce false positives, and adapt to evolving
fraud tactics [9]. They also optimize computational resources for scalability, ensuring
efficient performance even with large-scale data. As financial fraud becomes increasingly
sophisticated, the application of advanced ML and DL models helps institutions stay ahead
of threats, manage risks effectively, comply with regulations, and protect against financial
losses [10].
Class imbalance is a critical challenge in machine learning problems because it poses
significant issues for most machine learning algorithms. By default, these algorithms
optimize overall accuracy, which can cause models to ignore the minority class, prioritizing
correct predictions for the majority while failing to detect rare but critical instances. For
example, in fraud detection, fraudulent transactions may represent less than 1% of all
data, allowing a model to achieve 99% accuracy by naively labeling every transaction as
J. Cybersecur. Priv. 2025, 5, 9 3 of 36
legitimate [11]. However, accurately identifying rare fraudulent transactions is critical for
financial institutions to prevent losses. The imbalance makes it difficult for classifiers to
effectively learn from the limited examples of fraudulent transactions [12]. Traditional
methods designed for balanced datasets often focus on overall accuracy, which can result
in poor performance in detecting the minority class. To tackle this issue, several techniques
have been developed [13]. Data-level methods, such as oversampling and undersam-
pling, aim to adjust the dataset to mitigate the imbalance. Oversampling increases the
number of fraudulent transaction samples by duplicating them, while undersampling
reduces the number of legitimate transactions, potentially at the cost of losing valuable
information [4]. Advanced methods like the Synthetic Minority Oversampling Technique
(SMOTE) generate synthetic examples of fraudulent transactions [14], and its variants—like
the Adaptive Synthetic Sampling Approach (ADASYN) [15], Borderline-SMOTE, Majority
Weighted Minority Oversampling Technique (MWMOTE), and Weighted Kernel-Based
SMOTE—generate synthetic samples to better balance the dataset [16], helping to balance
the dataset without risking overfitting. These approaches are essential for improving
detection rates and effectively managing class imbalance in fraud detection systems.
Generative modeling has recently garnered significant attention due to its effectiveness
in handling diverse types of data and simulating sample behaviors [17]. Its applicability
extends to various domains, including image generation and noise reduction [18]. This
paper aims to leverage the capabilities of generative modeling to address the challenge
of imbalanced credit card fraud detection. Specifically, we propose several models: an
Autoencoder, a Variational Autoencoder (VAE), a Generative Adversarial Network (GAN),
and a hybrid architecture that combines a GAN with an Autoencoder. These techniques
are used for data augmentation by exploiting their ability to mimic synthetic datasets. The
choice of these models is justified by their proven effectiveness in generating synthetic
images and tabular datasets across various fields.
To evaluate the proposed solutions, we conducted extensive experiments using a real-
world credit card dataset. We utilized various standard evaluation metrics and introduced
a new metric, the Balanced Fraud Detection Score (BFDS), which combines these metrics
for more accurate results and to identify the best-performing methods. Our contributions
can be summarized as follows:
• Proposal of Machine Learning and deep learning Models: Several advanced machine
learning and deep learning models are proposed for detecting fraudulent transactions.
• Generative Models for Handling Imbalanced Learning: To address the issue of class
imbalance, we propose multiple generative models to create synthetic fraudulent sam-
ples based on historical datasets, including Autoencoders, Variational Autoencoders
(VAEs), Generative Adversarial Networks (GANs), and a hybrid model combining
GANs with Autoencoders. These models aim to balance the dataset and improve the
detection of rare fraudulent transactions.
• Introduction of a New Evaluation Metric: We introduce a novel metric called the Balanced
Fraud Detection Score (BFDS) that combines accuracy, precision, sensitivity, G-mean, and
specificity to provide a comprehensive assessment of model performance.
• Empirical Validation and Comparison: Extensive experiments are conducted using
a real-world credit card dataset. The results demonstrate the effectiveness of our
generative modeling solutions in classifying transactions and highlight their superior
performance compared to traditional methods like SMOTE and ADSYN based on the
BFDS metric.
These efforts aim to advance the field of fraud detection by providing innovative
solutions to class imbalance and enhancing the performance of detection systems.
J. Cybersecur. Priv. 2025, 5, 9 4 of 36
Our work is organized as follows: Section 2 reviews the related work, Section 3 pro-
vides background information on the proposed model, Section 4 discusses the methodology
and the materials used, Section 5 presents the experimental evaluation of our approach,
and Section 6 concludes the paper and outlines our future research plans.
2. Related Work
In the literature, numerous solutions have been proposed for maximizing the detection
of fraudulent transactions using a variety of approaches centered on machine learning (ML)
and deep learning (DL) models. To enhance these models, several strategies have been
developed, including the use of statistical processes, mathematical theories, and optimiza-
tion techniques such as metaheuristic algorithms [19,20] and Bayesian optimization [21].
Additionally, various methods have been proposed to handle imbalanced learning. In the
rest of this section, we provide a critical review of some significant works that aim to detect
fraudulent transactions effectively and with higher accuracy.
In a recent study on imbalanced classification [22], a novel approach called the
clustering-based noisy-sample-removed undersampling scheme (NUS) is introduced to
address the challenges faced in applications like credit card fraud detection (CCFD) and
defective part identification. The study highlights the difficulties classifiers encounter due
to noisy samples in both majority and minority classes. The NUS technique begins by
clustering majority-class samples and then utilizes the Euclidean distance from cluster
centers to define hyperspheres, identifying and excluding noisy samples. This method is
applied to both majority and minority classes to enhance the classifier’s performance. The
effectiveness of NUS is validated by integrating it with basic classifiers such as Random
Forest (RF), Decision Tree (DT), and Logistic Regression (LR) and comparing it with seven
other undersampling, oversampling, and noisy-sample-removed methods. The experi-
ments, conducted on 13 public datasets and three real e-commerce transaction datasets,
demonstrate that NUS significantly improves the performance of existing classifiers. In
another paper [23], the researchers highlight the significant impact of fraud on businesses
and individuals globally, where millions of US dollars are lost annually. With the surge in
online transactions, credit cards have become a prevalent payment method, but they have
also increased opportunities for fraudulent activities. Furthermore, the paper addresses the
critical issue of data imbalance in machine learning models used for fraud detection, as
fraudulent transactions constitute only a small percentage of the total data. This imbalance
can severely hinder the performance of classifiers. To tackle this, the study explores various
data augmentation techniques and introduces a novel model called K-means Convolutional
Generative Adversarial Network (K-CGAN), which is specifically designed for credit card
fraud detection. Additionally, they evaluate the effectiveness of different augmentation
techniques, including B-SMOTE, K-CGAN, and SMOTE, using major classification tech-
niques. The findings indicate that K-CGAN achieves the highest precision, recall, F1 score,
and accuracy, outperforming other methods and significantly enhancing the detection of
fraudulent transactions.
In [24], they focused on the importance of accurately classifying fraudulent trans-
actions to protect customers. Using machine learning methodologies, the study tested
various models, finding XGBoost to perform well with a precision score of 0.91 and an
accuracy score of 0.99. To address the dataset’s imbalance, several sampling techniques
were applied, with Random Oversampling emerging as the most effective, achieving a
precision and accuracy score of 0.99 with XGBoost. The study emphasizes the significance
of data-balancing methods in improving the performance of fraud detection models. Other-
wise, Ibomoiye et al. [25] tackle the challenges of credit card fraud detection by addressing
the issues posed by dynamic shopping patterns and class imbalance. They propose a robust
J. Cybersecur. Priv. 2025, 5, 9 5 of 36
deep learning approach, utilizing Long Short-Term Memory (LSTM) and Gated Recurrent
Unit (GRU) neural networks as base learners in a stacking ensemble framework, with a
Multilayer Perceptron (MLP) serving as the meta-learner. To manage the class imbalance
problem, the study employs the SMOTE-ENN method. As a result, they achieve a sen-
sitivity of 1.000 and a specificity of 0.997, outperforming other commonly used machine
learning classifiers and methods. This research underscores the potential of combining
advanced deep learning techniques with data balancing strategies to improve credit card
fraud detection systems. In addition, ref. [26] proposes a two-stage framework that uses
a deep Autoencoder for representation learning, followed by supervised deep learning
techniques for fraud detection. This approach significantly improves the performance of
deep learning classifiers compared to those trained on original data and other methods like
PCA. The findings highlight the effectiveness of this advanced method in enhancing fraud
detection systems.
Likewise, the authors in [27] proposed a framework called HNN-CUHIT that combines
a hybrid neural network with a clustering-based undersampling technique, leveraging
identity and transaction features. They evaluated their solution on a real dataset from a
city bank during the SARS-CoV-2 pandemic in 2020. As a result, the proposed solution
outperforms traditional models such as Logistic Regression, Random Forest, and CNN, par-
ticularly in handling imbalanced class distributions by achieving the best F1 score in fraud
detection, highlighting its superior performance in identifying fraudulent transactions.
This innovative approach offers a valuable contribution to improving fraud detection in the
financial sector. Furthermore, the study [28] proposes federated learning frameworks, such
as TensorFlow Federated and PyTorch. Their solution aims to enhance detection across
banks without sharing sensitive data. They compare individual and hybrid resampling
techniques, which prove that Random Forest classifiers outperform other models, achieving
the best performance metrics. The PyTorch framework yields higher prediction accuracy
for federated learning models, though with increased computational time, highlighting its
effectiveness in handling skewed datasets. In addition, the study [29] tackles the challenge
of acquiring labeled datasets, particularly in highly class-imbalanced domains like credit
card fraud detection. It introduces a novel methodology using Autoencoders to synthesize
class labels for such data. This approach minimizes the need for expert input by leverag-
ing an error metric from the Autoencoder to create new binary class labels. These labels
are then used to train supervised classifiers for fraud detection. Conducted experiments
demonstrate that the synthesized labels are of high quality, significantly improving clas-
sifier performance as measured by the area under the precision–recall curve. The study
also shows that increasing the proportion of positive-labeled instances enhances classifier
performance, effectively addressing class imbalance concerns. In [30], the authors focus
on developing a real-time fraud detection framework that can adapt to the constantly
changing fraud characteristics, handle the class imbalance, and complete separation issues
inherent in fraud data. The proposed solution includes a novel approach to managing
non-stationary changes in transaction patterns and a robust fuzzy logistic regression model
to tackle class imbalance and separation problems. This methodology improves model
training efficiency and maintains high specificity and sensitivity, even with small sample
sizes. The framework achieves an accuracy greater than 0.99 in identifying fraudulent and
non-fraudulent transactions, outperforming other machine learning and fraud detection
methods. The enhanced classification performance ensures better precision in detecting
fraudulent transactions, reduces false positives, and minimizes financial losses while in-
creasing customer satisfaction. Otherwise, Asma Cherif et al. [31] propose a new solution
based on Graph Neural Networks (GNNs) for credit card fraud detection. They focus on
selecting relevant features and designing a model to capture the relationships between
J. Cybersecur. Priv. 2025, 5, 9 6 of 36
entities like merchants and customers. Their novel encoder–decoder-based GNN model,
enhanced with a graph converter and batch normalization, showed promising results on a
large-scale dataset, outperforming other models in precision, recall, and F1 score.
In this paper, we aim to improve the detection of fraudulent transactions by address-
ing the imbalance issue through advanced generative modeling techniques. Unlike tradi-
tional methods, which often struggle with the sparse and imbalanced nature of fraudulent
transaction data, our approach utilizes Variational Autoencoders (VAEs), Autoencoders,
Generative Adversarial Networks (GANs), and a hybrid GAN-Autoencoder model. These
models are adept at generating synthetic fraudulent samples, thereby enriching the dataset
and enhancing the model’s ability to detect fraud. The efficacy of our approach is under-
scored by its demonstrated success in generating realistic synthetic data, as evidenced
by its performance in related fields such as image and text generation. This innovative
use of deep learning architectures ensures a more robust and accurate detection system,
which is capable of adapting to the evolving patterns of fraudulent behavior. However, our
approach has certain limitations. First, the quality of synthetic samples heavily depends on
the proper tuning of hyperparameters, which can be computationally intensive. Second, the
generated synthetic data may not fully capture rare or highly complex fraudulent patterns,
potentially limiting the model’s generalization to unseen cases. Table 1 provides a detailed
description of the cited works.
3. Generative Models
Generative models are a type of deep learning architecture used to capture the under-
lying structure of data and generate synthetic data by simulating the distribution of the
real data. Initially popularized for image generation due to their remarkable results, these
models have garnered significant interest from researchers exploring new applications,
such as dimensionality reduction and feature selection [32]. In this paper, we leverage
the capabilities of generative models to address the issue of data imbalance in our dataset.
Generally, this approach works as follows: given a training set Xtrain and a set of param-
eters θ, a model can be constructed to estimate the probability distribution of the data.
The likelihood is the probability that the model assigns to the training data for a dataset
containing m samples of x (i) :
m
∏ pmodel (x(i) ; θ ) (1)
i =1
The maximum likelihood method provides a way to compute the parameters θ that can
maximize the likelihood of the training data [33]. To simplify the optimization, we take the
logarithm of the likelihood in Equation (1) to express the probabilities as a sum rather than
a product:
m
θ ∗ = arg max ∑ log pmodel ( x (i) ; θ ) (2)
θ i =1
If the true data distribution pdata lies within the family of distributions represented by
pmodel ( x; θ ), the model can accurately approximate pdata . However, in practice, the true
distribution is not accessible, and only the training data are available for modeling [34].
Thus, the models must define their density function and find the pmodel ( x; θ ) that maximizes
the likelihood. Generative models produce synthetic data by learning the probability
distribution of the observed data and generating new samples from this learned distribution.
The process typically involves two key components:
1. Latent Variables: These are unobserved variables that capture the underlying factors
of variation in the data. Let z represent a vector of latent variables, which are typically
sampled from a simple prior distribution p(z). This prior is often chosen to be a
standard normal distribution, i.e., z ∼ N (0, I).
2. Generative Function: This function, parameterized by θ, maps the latent variables z
to the data space. The generative process can be expressed as x = G (z; θ ), where G is
a neural network or another function that transforms the latent space into the data
space, generating synthetic data samples x.
The objective of training a generative model is to approximate the true data distribution
pdata (x) by learning the model distribution pmodel (x). This involves optimizing the model
parameters θ such that the synthetic data distribution pmodel (x) closely matches the real
data distribution. Meanwhile, generative models are powerful tools for data generation, but
their application to imbalanced data problems comes with inherent challenges, particularly
mode collapse and instability during training. Mode collapse occurs when the generator
learns to produce a limited set of outputs, failing to capture the full diversity of the target
distribution, which can undermine the quality of the synthetic data generated for the
minority class. Additionally, the adversarial nature of those models can lead to training
instability, where the generator and discriminator fail to converge, resulting in poor-quality
synthetic samples. In the rest of this section, we describe the proposed models for handling
the imbalance issue.
J. Cybersecur. Priv. 2025, 5, 9 8 of 36
The reconstruction loss measures the difference between the input and the output,
serving as an objective function to be minimized during training. This loss, often calculated
using mean squared error or binary cross-entropy, ensures that the output X̃i closely
resembles the original input Xi . The training objective can be formulated as
where the loss function measures the dissimilarity between the input Xi and the recon-
structed output X̃i . In this paper, we used an Autoencoder architecture to balance our
credit card dataset. The architecture and the algorithm employed are described in detail in
Algorithm 1.
In this study, the hyperparameters for this Autoencoder were carefully chosen to
balance model complexity, prevent overfitting, and improve the model’s ability to capture
meaningful features of fraudulent samples. The model uses an input/output layer with
30 dimensions corresponding to the dataset’s features. The encoder reduces dimensionality
through three hidden layers (64, 32, and 8 units) with ReLU activation, addressing vanishing
gradient issues and improving convergence. Batch normalization stabilizes training, while
dropout rates of 0.2 and 0.3 prevent overfitting. The symmetric decoder structure aids
in effective data regeneration, and the final sigmoid output layer is suitable for binary
classification. The Adam optimizer with a learning rate of 0.001 ensures efficient training,
and the model is trained for 100 epochs with a batch size of 32 to balance convergence
speed and computational cost. Shuffling the data during training enhances generalization
by reducing the impact of data ordering.
J. Cybersecur. Priv. 2025, 5, 9 9 of 36
p( x, z) = p( x | z) p(z) (4)
The model’s inference is examined by computing the posterior of the latent vector using
Bayes’ theorem, as shown in the equation below:
p( x | z) p(z)
p(z | x ) = (5)
p( x )
Using any distribution variant, such as Gaussian, variational inference can approximate the
posterior. The reliability of this approximation can be assessed through the Kullback–Leibler
divergence, which measures the information loss during approximation. The architecture
and algorithm used for the VAE implementation in our study are detailed in Algorithm 2.
This algorithm outlines the training process for a VAE on fraudulent transaction samples,
with carefully chosen hyperparameters to balance model complexity, prevent overfitting, and
enhance the model’s ability to capture meaningful representations. The encoder architecture,
with hidden layers of 64 and 32 units, reduces the data dimensionality, capturing complex
patterns without overfitting. ReLU activation is used to mitigate the vanishing gradient prob-
lem and accelerate convergence, while batch normalization stabilizes training and improves
generalization. Dropout rates of 0.2 and 0.3 are applied to prevent overfitting by randomly
deactivating neurons during training. The encoder outputs the mean (µ) and log-variance
(log(σ2 )) for stochastic sampling, allowing the model to effectively learn from the data. The
decoder mirrors the encoder structure for symmetric data reconstruction, with a sigmoid out-
put layer suitable for binary classification. The loss function combines binary cross-entropy for
reconstruction and KL divergence for regularization, promoting both accurate reconstruction
and a structured latent space. The model is trained using the Adam optimizer with a learning
rate of 0.001 for efficient training, for 100 epochs with a batch size of 32 to balance convergence
speed and computational cost.
2. Discriminator: The discriminator network receives both real data and the data gen-
erated by the generator. It classifies these inputs as real or fake using a sigmoid
activation function and binary cross-entropy loss. The discriminator is trained to
distinguish between the real and generated data, providing feedback to the generator
on how well it is performing [42].
The generator and discriminator are trained together in a competitive process known
as a minimax game, where the generator tries to maximize the probability that the dis-
criminator mistakes fake data for real data, while the discriminator tries to minimize this
probability. This can be expressed mathematically as
J. Cybersecur. Priv. 2025, 5, 9 12 of 36
min max V ( D, G ) = Ex∼ pdata ( x) [log D ( x )] + Ez∼ pz (z) [log(1 − D ( G (z)))] (6)
G D
where E denotes the expected value, pdata ( x ) represents the distribution of real data, and
pz (z) represents the distribution of the noise input to the generator. During training, the
generator and discriminator engage in a dynamic process where the generator attempts to
improve its ability to produce realistic data while the discriminator continually refines its
ability to distinguish real data from fake data. This iterative process continues until the
discriminator can no longer reliably differentiate between real and fake data, indicating
that the generator has succeeded in producing highly realistic data. The feedback loop
provided by the discriminator is crucial for the generator’s learning process. After each
batch of training, backpropagation is used to update the weights of both the generator and
discriminator networks, optimizing their performance. Algorithm 3 shows our proposed
architecture of the GAN model used to address the imbalance issue. The choice of hyperpa-
rameters in the GAN training algorithm is made to optimize both model performance and
stability. The generator’s architecture uses progressively smaller layers (128, 64, 50, 40, and
15 units) to effectively map a high-dimensional latent space to the target data distribution.
The larger initial layers (128 and 64 units) capture more complex features, while the smaller
layers reduce the dimensionality to match the output data. ReLU activation functions are
employed throughout the generator to mitigate the vanishing gradient problem and speed
up convergence. Dropout is set to 0.5 to regularize the model and prevent overfitting by
randomly deactivating half of the units during training. Batch normalization is applied to
stabilize training by normalizing layer inputs, ensuring more consistent gradients. In the
discriminator, the choice of 128, 64, and 32 units, along with LeakyReLU activations, allows
the model to effectively distinguish between real and fake data while mitigating the risk of
dying neurons. Dropout is similarly set to 0.5 in the discriminator to avoid overfitting, and
the use of binary cross-entropy loss ensures the proper evaluation of fake and real data. The
Adam optimizer with a learning rate of 0.001 is selected for both models to ensure efficient
training and prevent the instability often seen with other optimizers in GAN training.
3.4. AE-GAN
AE-GAN is a hybrid approach that combines an Autoencoder and a Generative
Adversarial Network to effectively address the imbalance issue in credit card datasets. This
combination leverages the strengths of both models to improve the quality and diversity of
synthetic data, which is crucial for training robust fraud detection systems, as shown in
Figure 5. The process is outlined as follows:
1. Extract fraudulent samples from the training set.
2. Pass these samples through an Autoencoder (AE) to encode the data into a lower-
dimensional space.
3. Apply Principal Component Analysis (PCA) to reduce the dimensionality of the
encoded data to 15 features.
4. Train a Generative Adversarial Network (GAN) using the PCA-reduced data. The
GAN consists of a generator and a discriminator.
5. Use the generator to produce synthetic features based on the reduced data.
6. Pass these generated features through the decoder of the Autoencoder to reconstruct
the synthetic data.
J. Cybersecur. Priv. 2025, 5, 9 13 of 36
TP
Sensitivity = (9)
TP + FN
TN
Specificity = (10)
TN + FP
• F-measure: The F-measure, or F1 score, is the harmonic mean of precision and recall.
It provides a single score that balances the importance of both metrics:
Precision × Recall
F-measure = 2 × (12)
Precision + Recall
Traditional metrics like accuracy can be misleading due to the imbalance in the dataset.
For instance, a model that predicts all transactions as legitimate could still achieve high
accuracy due to the overwhelming number of legitimate transactions. To address this, we
also consider the G-mean and F-measure, which are better suited for evaluating models on
imbalanced datasets. G-mean ensures that the classifier performs well on both classes, while
F-measure balances the trade-off between precision and sensitivity. These metrics, along
with the analysis of the confusion matrix, provide a comprehensive view of the classifier’s
performance and its effectiveness in distinguishing between fraudulent and legitimate
behaviors. This approach allows us to better understand the strengths and weaknesses of
the classifiers and the impact of class imbalance on the classification results. In addition, for
an accurate comparison, we created a new metric based on Table 3, which summarizes all
the proposed metrics in this paper. This new metric, called the Balanced Fraud Detection
J. Cybersecur. Priv. 2025, 5, 9 16 of 36
The coefficients in the BFDS formula are designed to reflect the importance of each evalua-
tion metric in fraud detection. Metrics like recall, precision, and F-measure are assigned
higher weights due to their role in identifying fraud while minimizing errors. Recall has
the highest weight, as detecting fraudulent transactions is crucial, while precision and F-
measure balance false positives and overall model performance. All weights are divided by
60 to ensure that the gap between all metrics is reduced, providing a more comprehensive
and accurate evaluation of performance in imbalanced datasets.
Figure 14. BFDS comparison for each model across different oversampling techniques.
Table 4 presents the performance metrics of various machine learning models using
Generative Adversarial Networks for data augmentation. Among the evaluated models,
XGB achieved the highest sensitivity (0.830882) and demonstrated strong performance
across other metrics, including precision (0.941667), F-measure (0.882813), and G-mean
(0.911490). RF also performed competitively with an F-measure of 0.877470 and a G-mean
of 0.903393, indicating a balanced trade-off between sensitivity and specificity. LR and
DT showed moderate sensitivity (0.669118 and 0.823529, respectively), with DT having a
higher G-mean (0.907209) compared to LR (0.817924). Notably, the NB model performed
poorly, with a sensitivity of 0.000000 and corresponding F-measure and G-mean values of
0.000000, suggesting its ineffectiveness in the given context. Additionally, advanced neural
network models such as LSTM and ANN demonstrated robust performance, with LSTM
matching the performance of BC in all metrics. Overall, tree-based ensemble methods (RF,
J. Cybersecur. Priv. 2025, 5, 9 22 of 36
XGB, and LGBM) consistently outperformed other models, reflecting their ability to capture
complex data patterns effectively.
RNN model showed lower performance across all metrics, particularly in sensitivity and
precision, indicating its limited effectiveness.
Table 6 presents the performance metrics of various machine learning models using
Autoencoder (AE) for feature extraction. XGB achieved the best overall performance,
with the highest sensitivity (0.816176), precision (0.973684), and F-measure (0.888000),
along with a G-mean of 0.903409, indicating its effectiveness in maintaining a balance
between sensitivity and specificity. LGBM also performed competitively, with a sensitivity
of 0.801471 and an F-measure of 0.868526, demonstrating its robustness in handling the
AE-transformed data. RF followed closely with a sensitivity of 0.786765 and an F-measure
of 0.856000, further confirming the strength of ensemble-based approaches. Conversely, the
NB model again performed poorly, yielding sensitivity, precision, and F-measure values
of 0.000000, making it unsuitable for this dataset. Neural network models displayed
mixed results, with LSTM achieving better performance (F-measure of 0.809160) than ANN
(0.718182), while the RNN model exhibited the weakest performance across all metrics
(sensitivity of 0.029412 and F-measure of 0.000765), indicating challenges in learning from
AE-transformed data. DT and Bagging Classifier (BC) showed moderate performance,
with G-means of 0.878411 and 0.870199, respectively. Overall, tree-based ensemble models,
especially XGB and LGBM, outperformed other models, highlighting their superior ability
to extract meaningful patterns from AE-enhanced datasets.
Table 8 presents the performance metrics of various machine learning models using the
Synthetic Minority Oversampling Technique (SMOTE) to address class imbalance. Among
the models, Random Forest (RF) achieved the best overall performance, with a sensitivity
of 0.867647, a precision of 0.855072, and an F-measure of 0.861314, highlighting its strong
ability to detect minority class samples while maintaining high accuracy. XGB and ANN
models also performed competitively, with XGB achieving a sensitivity of 0.860294 and an
F-measure of 0.790541, while ANN recorded a sensitivity of 0.875000 and an F-measure
of 0.777778, showcasing their robustness in learning from the oversampled data. LGBM
achieved a similar sensitivity (0.867647) but had a lower precision (0.504274), resulting in
a lower F-measure (0.637838), indicating a trade-off between detecting positive samples
and minimizing false positives. In contrast, simpler models like LR and NB struggled
with low precision (0.052764 and 0.055274, respectively) and F-measures (0.099842 and
0.104031, respectively), despite having relatively high sensitivity (0.926471 for LR and
0.882353 for NB), reflecting their difficulty in handling the increased complexity of the
SMOTE data. DT and BC displayed moderate performance, with sensitivities of 0.750000
and 0.808824, respectively, and F-measures of 0.481132 and 0.698413. Notably, GB and
AB underperformed in precision (0.109075 and 0.053700) and F-measure (0.195008 and
0.101559), reflecting their challenges in balancing false positives and false negatives. Overall,
ensemble models—particularly RF, XGB, and ANN—outperformed other approaches,
demonstrating their effectiveness in handling class imbalance when combined with SMOTE.
Simpler models like LR, NB, and boosting methods exhibited lower precision and F-
measure, making them less suitable for datasets with imbalanced classes.
The bar plots in Figure 11, representing the model performance metrics with SMOTE,
showcase the accuracy, specificity, sensitivity, precision, F-measure, and G-mean of different
classifiers. Random Forest (RF) stands out with consistently high values across all metrics,
particularly in accuracy and specificity, emphasizing its effectiveness in detecting both
fraudulent and non-fraudulent transactions. XGB and ANN also display solid perfor-
mances, particularly in accuracy, specificity, and sensitivity, making them reliable choices
for fraud detection. On the other hand, models like Naive Bayes (NB), AdaBoost (AB), and
Logistic Regression (LR) show significant discrepancies, with low precision and F-measure,
indicating challenges in identifying fraudulent transactions accurately. Likewise, K-Nearest
Neighbors (KNN) and Gradient Boosting (GB) show a balance in their metrics, particularly
in sensitivity, though their precision and F-measure could be improved. Overall, the plot
J. Cybersecur. Priv. 2025, 5, 9 26 of 36
indicates that RF, XGB, and ANN are the top performers, while models like Naive Bayes
and AdaBoost need further optimization for better fraud detection.
Table 9 presents the performance metrics of various machine learning models us-
ing the Adaptive Synthetic (ADASYN) sampling technique to address class imbalance.
Among the models, Random Forest (RF) achieved the highest overall performance, with a
sensitivity of 0.845588, a precision of 0.864662, and an F-measure of 0.855019, indicating
a strong ability to accurately classify both the majority and minority classes. XGB also
performed well, with a sensitivity of 0.889706 and an F-measure of 0.793443, reflecting
its effectiveness in handling the ADASYN-augmented dataset. LGBM followed closely,
achieving a sensitivity of 0.904412 and a G-mean of 0.950063, though its lower precision
(0.421233) resulted in a lower F-measure (0.574766). Neural network models exhibited
competitive performance, with the Artificial Neural Network (ANN) achieving a sensitivity
of 0.875000 and an F-measure of 0.772727, while the LSTM model showed a sensitivity
of 0.882353 and an F-measure of 0.603015. In contrast, RNN performed less effectively,
with a lower sensitivity (0.838235) and an F-measure (0.173780), indicating its struggles in
capturing the patterns of the ADASYN-enhanced data. Simpler models such as Logistic
Regression (LR) and Naive Bayes (NB) underperformed despite having high sensitivity
(0.955882 for LR and 0.911765 for NB), with low precision (0.016447 and 0.035048, re-
spectively) and corresponding low F-measures (0.032338 and 0.067501). GB and AB also
showed weak performance in precision (0.044720 and 0.025422) and F-measure (0.085442
and 0.049476), highlighting their difficulty in effectively handling the oversampled data.
Overall, ensemble-based models—particularly RF, XGB, and ANN—demonstrated the best
performance under ADASYN, achieving a strong balance between sensitivity and precision.
In contrast, simpler models like LR, NB, and boosting algorithms struggled to maintain
high precision, limiting their effectiveness in this context.
The bar plots Figure 12 highlight the performance of various classifiers with ADASYN
in fraud detection. RF is the top performer, excelling in accuracy, specificity, sensitivity,
and F-measure. XGB and ANN also show strong results, particularly in sensitivity and
F-measure. LR struggles with precision and F-measure, while GB and AB underperform in
these metrics. NB has moderate performance but is weaker in fraud detection. KNN, LGBM,
and RNN show consistent sensitivity but need improvement in precision and F-measure.
Overall, RF and XGB lead in performance, while LR and AB require further optimization.
To assess the effectiveness of the proposed methods, we employed a Wilcoxon Rank-
Sum test at a 95% confidence level. This non-parametric test is used to determine whether
J. Cybersecur. Priv. 2025, 5, 9 27 of 36
there are significant differences between two independent samples. After resampling the
dataset, the resampled datasets were used to train six different classifiers. The classifiers’
performances were evaluated using various metrics. To compare the resampling techniques,
the average performance metrics were calculated. The null hypothesis H0 and alternative
hypothesis H1 for the Wilcoxon Rank-Sum test in this case can be formulated as follows:
• Null Hypothesis H0 : There is no significant difference between the performance
metrics of the two oversampling methods when applied to the resampled datasets.
• Alternative Hypothesis H1 : There is a significant difference between the performance
metrics of the two oversampling methods when applied to the resampled datasets.
The results of the statistical significance tests are presented in Tables 10–13, which
show the p-values for comparisons of sensitivity, precision, F-measure, and G-mean, re-
spectively. These tables display the p-values obtained from the Wilcoxon test for compar-
isons between pairs of resampling techniques for sensitivity, precision, F-measure, and
G-mean metrics.
Table 10 presents the p-values from the Wilcoxon Rank-Sum test for sensitivity compar-
isons among various oversampling techniques. The results indicate that VAE significantly
improves sensitivity compared to GAN, AE-GAN, and AE, with p-values of 0.6848, 0.9593,
and 0.2549, respectively, suggesting its effectiveness. Similarly, ADASYN demonstrates
notable improvements over GAN, AE-GAN, and AE, with p-values of 0.0012, 0.0022, and
0.0002, respectively, confirming its strong performance. However, the difference between
VAE and ADASYN is not statistically significant, as indicated by a p-value of 0.0022.
SMOTE also exhibits better sensitivity than GAN and AE, with p-values of 0.0034 and
0.0004, respectively, but does not significantly outperform VAE or ADASYN, as seen in
its p-values of 0.0017 and 0.1542. In contrast, GAN and AE-GAN show relatively higher
p-values compared to VAE and ADASYN, indicating less substantial improvements in sen-
sitivity. Overall, these findings highlight the superior effectiveness of VAE and ADASYN
in enhancing sensitivity, while GAN and AE-GAN are comparatively less impactful.
Table 10. Wilcoxon test p-values for sensitivity comparison between different oversampling techniques.
Table 11 presents the p-values obtained from the Wilcoxon Rank-Sum test for precision
comparisons among different oversampling techniques. The results indicate that VAE
significantly outperforms GAN, AE-GAN, and AE in terms of precision, with p-values of
0.0061, 0.0076, and 0.0229, respectively, highlighting its effectiveness in improving precision.
Similarly, ADASYN demonstrates notable improvements over GAN, AE-GAN, and AE,
with p-values of 0.0012, 0.0007, and 0.0017, respectively, confirming its strong performance.
However, there is no significant difference between VAE and ADASYN, as indicated by
a p-value of 0.0134, suggesting comparable performance between these two techniques.
SMOTE also shows better precision than GAN and AE, with p-values of 0.4801 and 0.0012,
respectively, but does not significantly differ from VAE or ADASYN, as seen in its p-values
of 0.0134 and 0.4801. In contrast, GAN and AE-GAN exhibit higher p-values compared to
VAE and ADASYN, indicating less substantial improvements in precision. Overall, these
J. Cybersecur. Priv. 2025, 5, 9 28 of 36
findings underscore the superior precision performance of VAE and ADASYN, while GAN
and AE-GAN perform relatively worse in this metric.
Table 11. Wilcoxon test p-values for precision comparison between different oversampling techniques.
Table 12 presents the p-values from the Wilcoxon Rank-Sum test for F-measure com-
parisons across different oversampling techniques. The results show that VAE significantly
enhances the F-measure compared to GAN, AE-GAN, and AE, with p-values of 0.0327,
0.0186, and 0.0843, respectively, reinforcing its effectiveness. ADASYN also demonstrates
a significant improvement over GAN, AE-GAN, and AE, with p-values of 0.0061, 0.0080,
and 0.0170, respectively, further validating its strong performance. Additionally, AE-GAN
outperforms AE with a p-value of 0.0262. However, the comparison between VAE and
ADASYN does not indicate a significant difference, as shown by a p-value of 0.0573, sug-
gesting comparable performance in improving F-measure. Conversely, SMOTE does not
show a significant advantage over VAE or ADASYN, with p-values of 0.0942 and 0.4327,
respectively, positioning it as less effective in enhancing the F-measure. Nonetheless,
SMOTE performs better than GAN and AE-GAN, with p-values of 0.0061 and 0.0061,
respectively. Overall, these findings confirm VAE and ADASYN as the most effective
techniques for optimizing the F-measure, while GAN and AE-GAN show comparatively
lower performance.
Table 12. Wilcoxon test p-values for f-measure comparison between different oversampling techniques.
Table 13 presents the p-values from the Wilcoxon Rank-Sum test for G-mean compar-
isons between different oversampling techniques. The results indicate that VAE signifi-
cantly enhances the G-mean compared to GAN, AE-GAN, and AE, with p-values of 0.8925,
0.9374, and 0.2393, respectively, reinforcing VAE’s effectiveness in improving the G-mean
metric. Similarly, ADASYN demonstrates a substantial improvement over GAN, AE-GAN,
and AE, with p-values of 0.0012, 0.0004, and 0.0002, respectively, highlighting its superior
performance. However, there is no significant difference between VAE and ADASYN, as
indicated by a p-value of 0.0002, suggesting comparable G-mean performance between
these two techniques. On the other hand, SMOTE does not show a significant improvement
over VAE or ADASYN, with p-values of 0.0012 and 0.9374, respectively, indicating its
relatively lower effectiveness in optimizing the G-mean. Overall, the findings confirm that
VAE and ADASYN are the most effective techniques for enhancing the G-mean, whereas
GAN, AE-GAN, and AE exhibit comparatively weaker performance in this regard.
J. Cybersecur. Priv. 2025, 5, 9 29 of 36
Table 13. Wilcoxon test p-values for G-mean comparison between different oversampling techniques.
Table 14 presents the Balanced F-Measure (BFDS) scores for various models across
different oversampling techniques, highlighting their effectiveness in handling class imbal-
ance. Among the oversampling techniques evaluated, AE-GAN consistently provides the
highest BFDS scores for most models, indicating its superior ability to enhance classifier
performance in detecting fraudulent transactions. Specifically, RF, with a BFDS of 0.697,
and XGB, with a BFDS of 0.691, achieve the highest scores, showcasing their robustness
and precision. These models, combined with AE-GAN, demonstrate the best performance,
effectively balancing sensitivity and precision. In comparison, traditional oversampling
techniques like SMOTE and ADASYN perform slightly lower, with RF scoring 0.685 and
0.671, respectively, under these methods. While these techniques are still effective, AE-
GAN’s innovative approach seems to offer a more nuanced enhancement, particularly for
ensemble methods like RF and XGB. Deep learning models, such as ANN, also benefit
significantly from AE-GAN, achieving a BFDS of 0.692, indicating strong potential for these
models in fraud detection tasks. Conversely, simpler models like NB and RNN exhibit
poor performance across all oversampling techniques, with notably low BFDS scores, un-
derscoring their limited utility in this context. Overall, the combination of AE-GAN with
advanced ensemble methods like RF and XGB emerges as the most effective strategy for
fraudulent transaction detection. This combination not only maximizes the BFDS but also
ensures a balanced approach to handling class imbalance, making it a superior choice for
optimizing model performance in this challenging domain.
Table 14. BFDS comparison for each model across different oversampling techniques.
The Wilcoxon test p-values for BFDS comparison across various oversampling tech-
niques are presented in Table 15. This table provides a comprehensive statistical analysis
of performance differences among the tested methods. The results indicate that VAE
exhibits significant improvements over multiple techniques, particularly compared to
GAN (p-value = 0.0024), AE-GAN (p-value = 0.0002), and AE (p-value = 0.0075). Simi-
larly, ADASYN demonstrates statistically significant differences when compared to GAN
J. Cybersecur. Priv. 2025, 5, 9 30 of 36
(p-value = 0.0061), AE-GAN (p-value = 0.0104), and AE (p-value = 0.0134). These find-
ings highlight the superior performance of VAE and ADASYN in enhancing BFDS met-
rics. Conversely, AE-GAN and AE do not exhibit significant differences from each other
(p-value = 0.0572), indicating their comparable performance. Moreover, SMOTE does
not show statistically significant improvements over most techniques, as reflected in
its relatively high p-values, particularly against VAE (p-value = 0.0409) and ADASYN
(p-value = 0.0803). Overall, these results reinforce the effectiveness of VAE and ADASYN
in improving BFDS performance compared to traditional oversampling techniques. In
contrast, methods like AE-GAN, AE, and SMOTE exhibit more comparable performance
levels, with fewer statistically significant differences.
This line illustrates AE-GAN’s superior performance in improving the balance between
sensitivity and precision compared to other techniques. In contrast, lines for other over-
sampling methods, such as SMOTE and ADASYN, generally display lower BFDS scores,
with the lines often positioned beneath the orange AE-GAN line. This indicates that while
these techniques are effective, they do not achieve the same level of enhancement in model
performance. Techniques like GAN, AE, and VAE show even lower BFDS scores, as their
lines remain further below the AE-GAN line, reflecting their reduced effectiveness. The
lines for NB and RNN also trail at the lower end, highlighting their difficulties in achieving
balanced performance. Overall, the orange AE-GAN line underscores its role as the most
effective oversampling technique for maximizing BFDS scores, surpassing other methods
in enhancing model performance.
Table 16 shows the obtained performance metrics for each model, highlighting the
best sensitivity, precision, F-measure, and G-mean values along with the corresponding
oversampling techniques. The analysis reveals that different oversampling techniques have
a notable impact on model performance. For instance, the AB model achieves the highest
sensitivity (0.933824) and G-mean (0.953585) using the SMOTE technique, indicating a
strong capability in detecting positive instances and maintaining a balanced performance.
In contrast, the ANN model exhibits superior precision (0.940476) with the AE technique,
demonstrating its effectiveness in reducing false positives, and performs well in F-measure
(0.832117) with VAE. The BC model, utilizing AE-GAN, excels in precision (0.913793), show-
casing its proficiency in correctly classifying positive instances. The DT model achieves
the highest G-mean (0.907209) with GAN, reflecting balanced performance but with lower
sensitivity and precision. Techniques such as SMOTE and ADASYN generally improve
sensitivity and G-mean across several models, highlighting their efficacy in managing class
imbalance. Conversely, AE and VAE techniques improve precision, as demonstrated by
ANN and XGB. Notably, the NB model shows a significant trade-off with high sensitivity
(0.911765) but low precision (0.055274) using SMOTE. These results emphasize the impor-
tance of selecting appropriate oversampling techniques to balance the trade-offs between
sensitivity, precision, and overall model performance.
J. Cybersecur. Priv. 2025, 5, 9 32 of 36
Figure 16 presents the best metric scores in terms of sensitivity, precision, F-measure,
and G-mean across various models, along with the associated oversampling techniques.
SMOTE is the most frequently appearing method, demonstrating its broad effectiveness,
particularly excelling in G-mean and sensitivity for models such as RF, KNN, and AB. GAN
also shows significant utility, notably enhancing precision and F-measure in models like
DT and RF, highlighting its strength in balancing sensitivity and precision. ADASYN is
employed in several instances, achieving impressive results in sensitivity and G-mean for
models including LR, GB, and LSTM. AE-GAN appears less frequently but is notable for
improving F-measure in models like GB and LSTM. AE and VAE are the least appearing
techniques, with AE showing strong performance in F-measure for XGB and VAE excelling
in precision for DT. This figure underscores the effectiveness of various oversampling
techniques, with SMOTE and ADASYN standing out for their broad applicability and GAN
and AE-GAN providing targeted improvements in specific metrics.
Figure 16. Best metric scores for each model with corresponding methods.
(0.453), closely followed by AE (0.452) and GAN (0.451). This indicates that generative
approaches are more effective than conventional methods in addressing class imbalance
for linear models. In DT and GB, ADSYN achieves the best performance (0.586 and 0.674,
respectively), suggesting that simpler oversampling techniques can still be effective for
these models. However, for more advanced models like RF and XGB, AE-GAN leads with
BFDS values of 0.697 and 0.691, respectively, highlighting its ability to generate high-quality
synthetic data that enhance fraud detection. For AB and BC, GAN achieves the highest
BFDS (0.620 and 0.678, respectively), while AE-GAN remains competitive, reinforcing the
strength of generative models in ensemble learning. In ANN, LSTM, and RNN, GAN
achieves the best results for ANN (0.694), while AE-GAN consistently ranks in the top
three for all neural models. This suggests that the hybrid AE-GAN model effectively
improves class balance while maintaining strong performance across diverse architectures.
For NB, SMOTE achieves the best BFDS (0.091), while generative models (AE-GAN and
GAN) perform similarly (0.070). This indicates that simple generative methods may not
be optimal for probabilistic models. Overall, AE-GAN ranks first or closely behind the
best-performing method across 11 out of 13 models, demonstrating its ability to handle
class imbalance effectively. Future work will focus on optimizing AE-GAN using Bayesian
optimization and distributed metaheuristic algorithms to further enhance performance
and scalability.
Table 17. Top 3 best oversampling techniques for each model based on BFDS.
6. Conclusions
Detecting fraudulent transactions is a critical challenge in the financial sector due to
the increasing sophistication of fraudulent activities and the substantial financial impact on
organizations. Effective fraud detection is essential for maintaining the integrity of financial
systems and protecting consumer assets. However, a significant hurdle in detecting fraud
is the imbalanced nature of fraud detection datasets, where fraudulent transactions are
rare compared to legitimate ones. This imbalance often leads to models that are biased
toward the majority class, making them ineffective at identifying fraudulent transactions.
To address this issue, we propose various generative models that exploit the capabilities
of generative modeling to produce synthetic data based on historical records. The mod-
els used include an Autoencoder, a Variational Autoencoder, a Generative Adversarial
Network (GAN), and a hybrid model that combines an Autoencoder and a GAN. These
models are employed to tackle the imbalanced learning problem. We conducted extensive
experiments comparing these generative models with traditional oversampling techniques
such as SMOTE and ADASYN. The results demonstrate that our proposed models yield
J. Cybersecur. Priv. 2025, 5, 9 34 of 36
promising outcomes based on newly introduced evaluation metrics that integrate multiple
key performance indicators. However, several challenges affect the training process of
these generative models, particularly the sensitivity to hyperparameters, which require
careful tuning to optimize performance. Future work will focus on improving the training
process by implementing hyperparameter optimization using distributed methods com-
bined with metaheuristic algorithms to enhance the efficiency and effectiveness of these
generative models.
Author Contributions: Conceptualization, M.T. and S.E.K.; Data curation, S.E.K.; Formal analysis,
M.T.; Funding acquisition, S.E.K.; Investigation, M.T. and S.E.K.; Methodology, M.T.; Project adminis-
tration, S.E.K.; Resources, M.T.; Software, M.T.; Supervision, S.E.K.; Validation, S.E.K.; Visualization,
M.T.; Writing—original draft, M.T.; Writing—review and editing, S.E.K. All authors have read and
agreed to the published version of the manuscript.
Data Availability Statement: This paper uses a European dataset to test the efficiency of the algo-
rithms. This dataset is publicly available online and is free of charge: https://www.kaggle.com/mlg-
ulb/creditcardfraud (accessed on 26 December 2024).
Conflicts of Interest: The authors declare that they have no known competing financial interests or
personal relationships that could have appeared to influence the work reported in this paper.
References
1. Chatterjee, P.; Das, D.; Rawat, D.B. Digital twin for credit card fraud detection: Opportunities, challenges, and fraud detection
advancements. Future Gener. Comput. Syst. 2024, 158, 410–426. [CrossRef]
2. Zioviris, G.; Kolomvatsos, K.; Stamoulis, G. An intelligent sequential fraud detection model based on deep learning. J.
Supercomput. 2024, 80, 14824–14847. [CrossRef]
3. Seera, M.; Lim, C.P.; Kumar, A.; Dhamotharan, L.; Tan, K.H. An intelligent payment card fraud detection system. Ann. Oper. Res.
2024, 334, 445–467. [CrossRef]
4. Gandhar, A.; Gupta, K.; Pandey, A.K.; Raj, D. Fraud Detection Using Machine Learning and Deep Learning. SN Comput. Sci.
2024, 5, 453. [CrossRef]
5. Bao, Q.; Wei, K.; Xu, J.; Jiang, W. Application of Deep Learning in Financial Credit Card Fraud Detection. J. Econ. Theory Bus.
Manag. 2024, 1, 51–57.
6. El Kafhali, S.; Tayebi, M. XGBoost based solutions for detecting fraudulent credit card transactions. In Proceedings of the
2022 International Conference on Advanced Creative Networks and Intelligent Systems (ICACNIS), Bandung, Indonesia, 23
November 2022 ; IEEE: New York, NY, USA, 2022; pp. 1–6.
7. Mienye, I.D.; Jere, N. Deep Learning for Credit Card Fraud Detection: A Review of Algorithms, Challenges, and Solutions. IEEE
Access 2024, 12, 96893–96910. [CrossRef]
8. Cherif, A.; Badhib, A.; Ammar, H.; Alshehri, S.; Kalkatawi, M.; Imine, A. Credit card fraud detection in the era of disruptive
technologies: A systematic review. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 145–174. [CrossRef]
9. Tayebi, M.; El Kafhali, S. A weighted average ensemble learning based on the cuckoo search algorithm for fraud transactions
detection. In Proceedings of the 2023 14th International Conference on Intelligent Systems: Theories and Applications (SITA),
Casablanca, Morocco, 22–23 November 2023; IEEE: New York, NY, USA, 2023; pp. 1–6.
10. Salekshahrezaee, Z.; Leevy, J.L.; Khoshgoftaar, T.M. The effect of feature extraction and data sampling on credit card fraud
detection. J. Big Data 2023, 10, 6. [CrossRef]
11. Strelcenia, E.; Prakoonwit, S. A survey on gan techniques for data augmentation to address the imbalanced data issues in credit
card fraud detection. Mach. Learn. Knowl. Extr. 2023, 5, 304–329. [CrossRef]
12. Alraddadi, A.S. A survey and a credit card fraud detection and prevention model using the decision tree algorithm. Eng. Technol.
Appl. Sci. Res. 2023, 13, 11505–11510. [CrossRef]
13. Kalid, S.N.; Khor, K.C.; Ng, K.H.; Tong, G.K. Detecting frauds and payment defaults on credit card data inherited with imbalanced
class distribution and overlapping class problems: A systematic review. IEEE Access 2024, 12, 23636–23652. [CrossRef]
14. Goswami, S.; Singh, A.K. A literature survey on various aspect of class imbalance problem in data mining. Multimed. Tools Appl.
2024, 83, 70025–70050. [CrossRef]
J. Cybersecur. Priv. 2025, 5, 9 35 of 36
15. Yadav, R.; Yadav, M.; Ranvijay; Sawle, Y.; Viriyasitavat, W.; Shankar, A. AI Techniques in Detection of NTLs: A Comprehensive
Review. Arch. Comput. Methods Eng. 2024, 31, 4879–4892. [CrossRef]
16. Btoush, E.A.L.M.; Zhou, X.; Gururajan, R.; Chan, K.C.; Genrich, R.; Sankaran, P. A systematic review of literature on credit card
cyber fraud detection using machine and deep learning. PeerJ Comput. Sci. 2023, 9, e1278. [CrossRef]
17. El Kafhali, S.; Tayebi, M. Generative adversarial neural networks based oversampling technique for imbalanced credit card
dataset. In Proceedings of the 2022 6th SLAAI International Conference on Artificial Intelligence (SLAAI-ICAI), Colombo,
Sri Lanka, 1–2 December 2022; IEEE: New York, NY, USA, 2022; pp. 1–5.
18. Sabuhi, M.; Zhou, M.; Bezemer, C.P.; Musilek, P. Applications of generative adversarial networks in anomaly detection: A
systematic literature review. IEEE Access 2021, 9, 161003–161029. [CrossRef]
19. Tayebi, M.; El Kafhali, S. Credit Card Fraud Detection Based on Hyperparameters Optimization Using the Differential Evolution.
Int. J. Inf. Secur. Priv. (IJISP) 2022, 16, 1–21. [CrossRef]
20. Tayebi, M.; El Kafhali, S. Performance analysis of metaheuristics based hyperparameters optimization for fraud transactions
detection. Evol. Intell. 2024, 17, 921–939. [CrossRef]
21. El Kafhali, S.; Tayebi, M.; Sulimani, H. An Optimized Deep Learning Approach for Detecting Fraudulent Transactions. Information
2024, 15, 227. [CrossRef]
22. Zhu, H.; Zhou, M.; Liu, G.; Xie, Y.; Liu, S.; Guo, C. NUS: Noisy-sample-removed undersampling scheme for imbalanced
classification and application to credit card fraud detection. IEEE Trans. Comput. Soc. Syst. 2023. [CrossRef]
23. Strelcenia, E.; Prakoonwit, S. Improving classification performance in credit card fraud detection by using new data augmentation.
AI 2023, 4, 172–198. [CrossRef]
24. Gupta, P.; Varshney, A.; Khan, M.R.; Ahmed, R.; Shuaib, M.; Alam, S. Unbalanced credit card fraud detection data: A machine
learning-oriented comparative study of balancing techniques. Procedia Comput. Sci. 2023, 218, 2575–2584. [CrossRef]
25. Mienye, I.D.; Sun, Y. A deep learning ensemble with data resampling for credit card fraud detection. IEEE Access 2023,
11, 30628–30638. [CrossRef]
26. Fanai, H.; Abbasimehr, H. A novel combined approach based on deep Autoencoder and deep classifiers for credit card fraud
detection. Expert Syst. Appl. 2023, 217, 119562. [CrossRef]
27. Huang, H.; Liu, B.; Xue, X.; Cao, J.; Chen, X. Imbalanced credit card fraud detection data: A solution based on hybrid neural
network and clustering-based undersampling technique. Appl. Soft Comput. 2024, 154, 111368. [CrossRef]
28. Abdul Salam, M.; Fouad, K.M.; Elbably, D.L.; Elsayed, S.M. Federated learning model for credit card fraud detection with data
balancing techniques. Neural Comput. Appl. 2024, 36, 6231–6256. [CrossRef]
29. Kennedy, R.K.; Villanustre, F.; Khoshgoftaar, T.M.; Salekshahrezaee, Z. Synthesizing class labels for highly imbalanced credit card
fraud detection data. J. Big Data 2024, 11, 38. [CrossRef]
30. Charizanos, G.; Demirhan, H.; İçen, D. An online fuzzy fraud detection framework for credit card transactions. Expert Syst. Appl.
2024, 252, 124127. [CrossRef]
31. Cherif, A.; Ammar, H.; Kalkatawi, M.; Alshehri, S.; Imine, A. Encoder–decoder graph neural network for credit card fraud
detection. J. King Saud-Univ.-Comput. Inf. Sci. 2024, 36, 102003. [CrossRef]
32. Sampath, V.; Maurtua, I.; Aguilar Martin, J.J.; Gutierrez, A. A survey on generative adversarial networks for imbalance problems
in computer vision tasks. J. Big Data 2021, 8, 1–59. [CrossRef]
33. Abukmeil, M.; Ferrari, S.; Genovese, A.; Piuri, V.; Scotti, F. A survey of unsupervised generative models for exploratory data
analysis and representation learning. Acm Comput. Surv. (CSUR) 2021, 54, 1–40. [CrossRef]
34. Cheng, Y.; Wang, C.H.; Potluru, V.K.; Balch, T.; Cheng, G. Downstream task-oriented generative model selections on synthetic
data training for fraud detection models. arXiv 2024, arXiv:2401.00974.
35. Tayebi, M.; El Kafhali, S. Combining Autoencoders and Deep Learning for Effective Fraud Detection in Credit Card Transactions.
Oper. Res. Forum 2025, 6, 1–30. [CrossRef]
36. Singh, R.; Srivastava, N.; Kumar, A. Network Anomaly Detection Using Autoencoder on Various Datasets: A Comprehensive
Review. Recent Patents Eng. 2024, 18, 63–77. [CrossRef]
37. Singla, J.; Kanika. A survey of deep learning based online transactions fraud detection systems. In Proceedings of the 2020
International Conference on Intelligent Engineering and Management (ICIEM), London, UK, 17–19 June 2020; IEEE: New York,
NY, USA, 2020; pp. 130–136.
38. Khemakhem, I.; Kingma, D.; Monti, R.; Hyvarinen, A. Variational autoencoders and nonlinear ica: A unifying framework. In
Proceedings of the International Conference on Artificial Intelligence and Statistics, Palermo, Italy, 26–28 August 2020; PMLR:
New York, NY, USA, 2020; pp. 2207–2217.
39. Akkem, Y.; Biswas, S.K.; Varanasi, A. A comprehensive review of synthetic data generation in smart farming by using variational
autoencoder and generative adversarial network. Eng. Appl. Artif. Intell. 2024, 131, 107881. [CrossRef]
40. Zhao, C.; Sun, X.; Wu, M.; Kang, L. Advancing financial fraud detection: Self-attention generative adversarial networks for
precise and effective identification. Financ. Res. Lett. 2024, 60, 104843. [CrossRef]
J. Cybersecur. Priv. 2025, 5, 9 36 of 36
41. Zhao, P.; Ding, Z.; Li, Y.; Zhang, X.; Zhao, Y.; Wang, H.; Yang, Y. SGAD-GAN: Simultaneous Generation and Anomaly Detection
for time-series sensor data with Generative Adversarial Networks. Mech. Syst. Signal Process. 2024, 210, 111141. [CrossRef]
42. Mishra, A.K.; Paliwal, S.; Srivastava, G. Anomaly detection using deep convolutional generative adversarial networks in the
internet of things. ISA Trans. 2024, 145, 493–504. [CrossRef]
43. Kaggle. Credit Card Fraud Detection. 2018. Available online: https://www.kaggle.com/mlg-ulb/creditcardfraud (accessed on
26 December 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.