0% found this document useful (0 votes)
15 views6 pages

Synth

This document discusses the use of Generative Adversarial Networks (GANs) for generating synthetic data to address class imbalance in credit card fraud detection. It highlights the effectiveness of a novel K-CGAN approach in improving classifier performance compared to traditional methods and other oversampling techniques. The study emphasizes the importance of balanced datasets for enhancing the accuracy of fraud detection models.

Uploaded by

Em Em,
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views6 pages

Synth

This document discusses the use of Generative Adversarial Networks (GANs) for generating synthetic data to address class imbalance in credit card fraud detection. It highlights the effectiveness of a novel K-CGAN approach in improving classifier performance compared to traditional methods and other oversampling techniques. The study emphasizes the importance of balanced datasets for enhancing the accuracy of fraud detection models.

Uploaded by

Em Em,
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Generating Syntetic Data for Credit Card Fraud

Detection Using GANs


Emilija Strelcenia Simant Prakoonwit
Department of Creative Technology Department of Creative Technology
Bournemouth University Bournemouth University
Bournemouth, United Kingdom Bournemouth, United Kingdom
[email protected] [email protected]

Abstract—Deep learning-based classifiers for object conducted a comparative study on the approaches such as
classification and recognition have been utilized in various MNET, SVM, RF and sampling techniques such as under-
sectors. However according to research papers deep neural sampling, SMOTE. In addition, they also described the
networks achieve better performance using balanced datasets significance of revising a model in non-stationary
than imbalanced ones. It’s been observed that datasets are often circumstances to attain desired results. In this study, [2]
imbalanced due to less fraud cases in production environments. acknowledged the RF method is the most effective approach
Deep generative approaches, such as GANs have been applied as when compared with other models.
an efficient method to augment high-dimensional data.
In this research study, the classifiers based on a Random An empirical study conducted by [3] aimed to address the
Forest, Nearest Neighbor, Logistic Regression, MLP, Adaboost class imbalance issue. They considered various approaches to
were trained utilizing our novel K-CGAN approach and address this issue in the credit card based fraud domain. They
compared using other oversampling approaches achieving higher discussed over-sampling, under-sampling, SMOTE, and cost-
F1 score performance metrics. sensitive learning threshold techniques. Furthermore, they
Experiments demonstrate that the classifiers trained on the carried out a comparative study and also identified the impact
augmented set achieved far better performance than the same of the degree of skewed distribution on given classifiers. In
classifiers trained on the original data producing an effective addition, they have also conducted their study on the Naï ve
fraud detection mechanism. Furthermore, this research
Bayes approach with multiple degrees of skewed distribution
demonstrates the problem with data imbalance and introduces a
and assessed their findings.
novel model that's able to generate high quality synthetic data.
Similarly, [4] introduced an innovative method by using
Keywords—fraud, GANs, synthetic data, class imbalance weighted extreme learning machines to address skewed data
problems. This approach comprises an improved neural
I. INTRODUCTION
network with a single hidden layer neural network. They
Imbalanced class is one of the most difficult tasks while assigned several weights to all samples. The findings suggest
detecting the credit card fraud. In order to address this problem, that their approach is different when dealing with the data
[1] introduced two different frameworks, i.e. data-oriented and imbalance issue. Also, the results confirm that there is a
algorithmic approach. First of all, for algorithmic framework, performance improvement when compared with other
these scholars used the RF approach, KNN, ID3, and Naï ve approaches. Furthermore [5] in their study emphasize that
Bayes for multiple samples, picked from 3 datasets. Moreover, imbalance of class is a regular challenge when dealing the
they have selected the most effective classifiers based on classification task via machine learning algorithms. They argue
misclassification cost with the help of the probability threshold. that this problem is not associated with the detection of fraud in
Apart from the algorithm framework, they have also proposed the credit card domain only. Their study was focused on the
a data-oriented framework, which deals with the resampling imbalanced class problem linked with the bankruptcy
procedures such as SMOTE, under-sampling, and over- prediction task. The authors of this study introduced two
sampling. According to the authors of this study, the data- models to handle the imbalanced class issue. In addition, they
centric method with the help of the over-sampling technique developed a hybrid framework of sensitive learning and over-
attains the desired results. On the other hand, the under- sampling methods. At the beginning of the study, they applied
sampling method achieves inferior results. Furthermore, they the over-sampling technique over the validation set with the
have also implemented an algorithmic-based framework help of an optimal balancing ratio to get optimal output.
employing the F1 score as an evaluation metric and recognized Additionally, for bankruptcy prediction, they used a cost-
Random Forest as an excellent classifier. sensitive learning model, C-Boost. The data they employed
was highly imbalanced with a ratio of 0.0026. Furthermore, the
In another study, [2] introduced a framework for handling
authors of this study stressed the likelihood of model over-
the imbalanced class issue, with the help of the data mining
fitting by using over-sampling methods as it creates copies of
method. Their approach creates a novel model whenever new
minority classes to balance the data.
data in the system arrives. In addition, the authors have
II. RELATED WORK variants of the conventional GAN approach. Furthermore,
[6] have explored multiple aspects related to GAN. They regarding the applicability of GANs, considerable
argued that GANs are a more appropriate and effective developments were made around various areas such as image
framework for handling imbalanced class problems than other creation, computer vision, social media fraud, gambling fraud,
sampling models. The authors believe that GAN is highly credit card-based fraud detection, and others.
robust towards overlapping and over-fitting, as GAN The traditional GAN approach estimates models via an
understands the hidden patterns of data by utilizing deep adversarial mechanism, in which two neural networks are
networks. Furthermore, they have also emphasized the trained. GANs utilize two neural networks: the Generator G
effectiveness of GANs employing several facets like and the Discriminator D networks. The function of G is to
architectural design, difficulties associated with GAN, multiple input a random noise vector to synthetic data that nearly
variants to address specific traits, application areas and so on. reflects the actual data. On the other hand, the use of D is to
In addition, they also pointed out the empirical study take actual samples and to perform as a teacher that can
conducted for evaluating GAN with the help of metrics. evaluate the performance of output and check if data is fake or
Moreover, they also performed a comparative study on the real. G and D are trained in such a way that, through a min-
performance of GAN with resampling methods such as max game, the losses of G get minimized, and the losses of D
SMOTE. Their study reveals that GAN is more effective than get maximized [7].
other resampling methods. The finding of this study reveals
that GAN variants such as WGAN and WGAN GP are most The Discriminator is a classifier that gets real and artificial
suitable to mitigate the above issues. data from the Generator, and the D tries to discriminate the
data. Firstly, the D classifies the real/ artificial data, and
The study conducted by [7] aimed to review several aspects secondly, the D penalizes for misclassification.
of GANs. The authors considered GAN variants such as
CGAN, and fully connected GAN and explored the pros and While the Generator uses the input from the D to learn to
cons connected with these GAN variants. generate artificial data that must have the same traits as the
original data.
The study by [8] is unique from the above studies as their
study examined GAN in theoretical and mathematical A standard GAN is made of a generator neural network
approaches. This study provides a deep insight into the training and a discriminator neural network . They are trained in
complications linked with GAN variants. In addition, this study competition with each other known as a two-player min-max
has presented 3 different points of view to tackle the problems game. The discriminator network rebalances it’s weights in
while training GAN. These points are skills, GAN structure order to determine real data samples from fake
and the objective of the framework. The authors of this study data samples produced by adding randomly sampled
assert that inception score, multi-scale structural similarity, from some distribution via the generator network. Following
model score and freshet inception distance are the most by the balancing it’s weights to trick.
effective metrics to evaluate the capability of GAN. Then the discriminator allocates probability for the case
[9] focused on the limitation and suitability of GAN while where is a “real” training data sample while the probability of
dealing with banking challenges. In their study, they made use for the case where is a “fake” sample produced by the
of the WGAN GP variant to augment data. The findings of generator. These two networks are going through the iterative
their study noticed a major increase of 5 percent in the recall training utilizing the loss function provided by:
value of the XG Boost classifier after training on augmented
data compared with real-world data training. It is also
noteworthy to mention that they detected a decrease in F1 pd [log ] [log ] (1)
score and precision values.
Where G tries to minimize while tries to
To sum up the above discussion, the most common
maximize it. In practice, the assumptions are replaced by
challenge while dealing with fraud in the credit card domain is
empirical mean values over a mini-lot of samples, while the
the class imbalance problem. In more recent years, scholars
loss function is further minimized and maximized from the first
have presented various machine learning techniques to deal
mini-lot to the next, as in the gradient descent of the mini-lot.
with this problem. One of the most popular and effective
technique to handle imbalanced class are GANs. Furthermore, Figure 1 shows the process of preprocessing data and
many scholars have also proposed various GAN variants to creating balanced data sets. The proposed solution comprises
deal with this issue. These developments in GAN are making it of two neural network classifiers, which are defined as
the most effective method. However, more research work is discriminator (D) and generator (G). [10] introduced this type
needed in future to improve the predictability, efficacy, of architecture that was inspired by game theory. Using these
accuracy and applicability of GAN variants. neural networks, the GAN generates new data samples that are
similar to training data based on the probability distribution
III. GANS model. However by its nature of being a very adaptable and
GANs, a series of machine learning approaches for general algorithm, meticulous fine tuning of GANs proved to
generation, was proposed by [10]. These machine learning resolve its drawbacks, which in the end may produce the
algorithms obtained much attention due to their efficiency and optimized architecture design that can be applied for various
simplicity. In a brief period, researchers introduced novel ML purposes.
B. Discriminator Loss
The objective of discriminator network is to maximize
likelihood of sample x if belongs to real data and minimize
likelihood of sample x if belongs to fake data. The equation
below shows the Discriminator loss:
output
si e

oss ∑ yi log ( yi ) log (2)


utput i i
i
si e

C. Generator Loss
Fig. 1. Process of data preprocessing and balanced dataset generation
(Goodfellow, in [10]).
The objective of generator network is to fool the
discriminator by generating fake samples which look like real
A. Experimental Design samples. In our proposed K-CGAN model we've added a new
loss term, KL Divergence, to our equation. The difference
There are a few different loss functions that can be used in between two distributions is calculated using the KL
GANs, and the choice of which one to use depends on the type divergence. As a result, our Generator loss has two objectives:
of data being generated. For example, if the data is images,
then the loss function might be based on the mean squared  Make the Discriminator fool. We use binary cross
error between the generated image and the real image. Other entropy for this loss
types of data might use other loss functions. The most
important thing to remember about GANs is that the loss  Make sure synthetic data distribution is the same as
function is used to train the generator, not the discriminator. original data distribution. We used KL Divergence for
The reason for this is that the generator is trying to generate this loss.
data that is realistic enough to fool the discriminator, while the The equations below show binary cross entropy and KL
discriminator is trying to learn to distinguish between real and divergence losses:
fake data. This means that the generator is trying to minimize
the loss function, while the discriminator is trying to maximize
it. One common way to think about this is that the generator is output
si e pi
trying to find a “sweet spot” in the loss function landscape - utput ∑i yi log i
( -yi ) log - i
∑ pi log
qi
where the fake data is realistic enough to fool the discriminator si e
but not so realistic that it is indistinguishable from the real data. (3)
The other important thing to remember about GANs is that
they are inherently unstable. This is because the generator and Kilberg divergence is a measures of how close
discriminator are both trying to learn at the same time, and they distributions are. Many hyperparameters had to be adjusted to
are both trying to learn from the same data. This can cause achieve the best performance possible with our proposed
them to “fight” with each other and can lead to training method. The hyperparameters below have been identified as
instabilities. There are a few ways to deal with this, such as the best option after extensive experimenting. The settings we
using different types of GANs (see below), or using different used are shown in Table I and Table II. Learning rate was set
loss functions. There are various types of GANs, and each one to .001, hidden layer optimizer Relu, random noise vector 100.
has its own advantages and disadvantages. The most common The dropout ratio was set to .1 for both the discriminator and
type of GAN is the vanilla GAN, which is the simplest type of generator hidden layers. Bath size 64 and number of epochs is
GAN. Vanilla GANs are good for generating simple data, such 100. Relu activation function for the generator and for
as images of handwritten digits. They are also relatively easy to LeakyRelu for the discriminator. Adam optimizer was defined.
train, and don’t require a lot of computational power. However, We have discovered that by utilizing the Weight Initialization
they are not very good at generating complex data, such as (glorot_uniform) and Weight Regularizer (L2 method)
natural images. Another type of GAN is the conditional GAN, methods, we were able to reduce the size of our neural network
which is similar to a vanilla GAN but with one additional during training.
condition. The condition can be anything, but it is usually
something that can help the generator generate more realistic We have also tried to use different Dropout values and
data. For example, if the data is images of faces, then the found that a value of .1 worked best for our case. Kernel
condition might be the age of the person in the image. This regularizer L2 method worked best for this dataset. This is
would help the generator generate more realistic images of probably due to the fact that there are many features in this
people of different ages. Conditional GANs are more difficult dataset, and some of them are likely to be highly correlated. L2
to train than vanilla GANs, but they can generate more realistic regularization helps to prevent overfitting by penalizing high
data. In our multiple e periments we’ve been utilizing CGAN weights, and thus encourages the model to find a simpler
architecture with fine-tuned hyperparameters with the novel solution. Figure 2 and 3 demonstrate the architecture of K-
loss function. CGAN Discriminator and Generator neural networks.
PCA is a good method for dimensionality reduction, but it
can sometimes introduce information loss. In this case, we are
not too worried about information loss because we are only
interested in the class prediction (fraud or not fraud), and not in
the details of the individual features.

TABLE I. GENERATOR NEURAL NETWORK HYPERPARAMETER SETTINGS

Parameter Value
Learning Rate .0001
Hidden Layer Optimizer Relu
Output Optimizer Adam
Loss Function Trained Discriminator Loss
+ KL Divergence
Hidden Layers 2 - 128 ,64
Dropout .1
Random Noise Vector 100
Kernel Initializer glorot_uniform
Kernel Regularizer L2 method
Total Learning Parameters 36,837
Fig. 3. K-CGAN generator architecture with novelty loss.
TABLE II. DISCRIMINATOR NEURAL NETWORK HYPERPARAMETER SETTINGS
We have used publicly available imbalanced Credit Card
Parameter Value Fraud dataset from Kaggle.
Learning Rate .0001
Hidden Layer Optimizer LeakyRelu TABLE III. REAL-WORLD CREDIT CARD DATASET
Output Optimizer Adam ID Data Set #Features #Instances IR
Loss Function Binary Cross Entropy
Hidden Layers 2 -20,10 1 Credit Card Fraud 30 2,492 1:4.07
Dropout .1
Kernel Regularizer L2 method This is a public dataset that can be accessed and
Total Learning Parameters 1,519 downloaded from Kaggle. The dataset contains transactions
made by credit cards in September 2013 by European
cardholders. This dataset presents transactions that occurred in
two days, where we have 492 frauds out of 284,315
transactions. The dataset is highly unbalanced, the positive
class (frauds) account for 0.172% of all transactions.

Fig. 4. Original imbalanced dataset (Kaggle).

Fig. 5. Balanced dataset showing equal number of minority and majority class
Fig. 2. K-CGAN discriminator architecture. samples.
0.72566 0.99071 0.98706 0.99669 0.8539 0.99973
4 3 7 1 0 3

neighbor
Nearest
0.76855 0.99814 0.99813 0.99863 0.8807 0.99974
9 6 1 4 0 7

MLP
0.67307 0.94550 0.88438 0.98533 0.7500 0.99961

regression
7 0 1 8 0 3

Logistic
To characterize our approach as successful, the following
criteria must be satisfied:
H1. Utilizing K-CGAN to improve imbalanced datasets
will result in better performance of algorithms on those
datasets.
Fig. 6. Flowchart of our experimental process. H2. These were evaluated by combining the original and
artificial sets with the four classification algorithms, including
It contains only numerical input variables which are the Xgboost, LR, RF, XGBoost, and MLP.
result of a PCA transformation. Unfortunately, due to
confidentiality issues, we cannot provide the original features With the original dataset, we trained our K-CGAN model
and more background information about the data. Features V1, to produce a synthetic dataset. We then tested it with various
V2 … V28 are the principal components obtained with PCA, classification algorithms and saw an improvement in the f1
the only features which have not been transformed with PCA score when introduced fraud transactions through the K-CGAN.
are 'Time' and 'Amount'. Feature 'Time' contains the seconds The experiment process we followed is detailed in the
elapsed between each transaction and the first transaction in the flowchart below (Figure 6).
dataset. The feature 'Amount' is the transaction Amount, this
feature can be used for example-dependant cost-sensitive IV. RESULTS
learning. Feature 'Class' is the response variable and it takes For credit card fraud, we show the classification results
value 1 in case of fraud and 0 otherwise. Figure 4 demonstrate obtained after 100 epochs for each oversampling technique and
state of the original imbalanced dataset and figure 5 show state classification algorithm.
of the dataset upon introducing equal number of samples from
minority class distribution. We divided the data into testing and training sets. The
training set included 80% of each class's samples, while the
TABLE IV. CREDIT CARD DETECTION RESULTS: F1 SCORE MEASURE testing set contained the remaining 20%.
We report the F1-score. The best results for each metric are
in bold As can be seen from the table IV, our method improves
Algorithm

Original

Adasyn

CGAN
Smote

Smote

cGAN

the performance of all the classification algorithms. This


B-

K-

demonstrates the effectiveness of our method in terms of


imbalanced data classification
0.82456 0.99961 0.99972 0.99976 0.8821 0.99981
V. CONCLUSION
1 3 6 0 0 7
XGBoost

We created a new method, K-CGAN, for generating


synthetic data with CGANs that uses KL divergence in the
Generator loss function. We compared our approach against
well-known oversampling techniques (SMOTE, B-SMOTE
0.82456 0.99974 0.99967 0.99943 0.8887 0.99981
1 5 3 1 0 0 and ADASYN) as well as other adversarial network
Random

architectures used to generate new data (cGANs).


Forest

We conducted a study to assess how well K-CGAN can


generate high-quality synthetic data. We compared the
performance of five machine learning classification algorithms
that were combined with our method, using a publicly
available credit card fraud dataset. The results in Table IV
show that K-CGAN outperformed all other oversampling [11] Assefa, S.A., Dervovic, D., Mahfouz, M., Tillman, R.E., Reddy, P. and
methods, achieving the highest overall rank. In addition to Veloso, M., 2020, October. Generating synthetic data in finance:
opportunities, challenges and pitfalls. In Proceedings of the First ACM
SMOTE, ADASYN, B-SMOTE and cGAN, our method had International Conference on AI in Finance (pp. 1-8).
the best performance. In future we are planning to use K- [12] Charitou, C., Dragicevic, S. and Garcez, A.D.A., 2021. Synthetic Data
CGAN for detecting other types of anomaly not just in credit Generation for Fraud Detection using GANs. arXiv preprint
card dataset but also in time series and computer network arXiv:2109.12546.
traffic dataset. [13] Eckerli, F. and Osterrieder, J., 2021. Generative Adversarial Networks
in finance: an overview. arXiv preprint arXiv:2106.06364.
REFERENCES [14] Ferreira, F., Lourenço, N., Cabral, B. and Fernandes, J.P., 2021. When
[1] Brennan, P., 2012. A comprehensive survey of methods for overcoming Two are Better Than One: Synthesizing Heavily Unbalanced Data. IEEE
the class imbalance problem in fraud detection. Institute of technology Access, 9, pp.150459-150469.
Blanchardstown Dublin, Ireland. [15] Koshiyama, A., Firoozye, N. and Treleaven, P., 2019. Generative
[2] Dal Pozzolo, A., Caelen, O., Le Borgne, Y.A., Waterschoot, S. and adversarial networks for financial trading strategies fine-tuning and
Bontempi, G., 2014. Learned lessons in credit card fraud detection from combination. arXiv preprint arXiv:1901.01751.
a practitioner perspective. Expert systems with applications, 41(10), [16] Mirza, M. and Osindero, S., 2014. Conditional generative adversarial
pp.4915-4928. nets. arXiv preprint arXiv:1411.1784.
[3] Thabtah, F., Hammoud, S., Kamalov, F. and Gonsalves, A., 2020. Data [17] Rundo, F., Trenta, F., di Stallo, A.L. and Battiato, S., 2019. Machine
imbalance in classification: Experimental evaluation. Information learning for quantitative finance applications: A survey. Applied
Sciences, 513, pp.429-441. Sciences, 9(24), p.5574.
[4] Zhu, H., Liu, G., Zhou, M., Xie, Y., Abusorrah, A. and Kang, Q., 2020. [18] Takahashi, S., Chen, Y. and Tanaka-Ishii, K., 2019. Modeling financial
Optimizing weighted extreme learning machines for imbalanced time-series with generative adversarial networks. Physica A: Statistical
classification and application to credit card fraud detection. Mechanics and its Applications, 527, p.121261.
Neurocomputing, 407, pp.50-62. [19] Wiese, M., Knobloch, R., Korn, R. and Kretschmer, P., 2020. Quant
[5] Le, T., Vo, M.T., Vo, B., Lee, M.Y. and Baik, S.W., 2019. A hybrid GANs: deep generation of financial time series. Quantitative Finance,
approach using oversampling technique and cost-sensitive learning for 20(9), pp.1419-1440.
bankruptcy prediction. Complexity, 2019. [20] Zhang, Z., Yang, L., Chen, L., Liu, Q., Meng, Y., Wang, P. and Li, M.,
[6] Ngwenduna, K.S. and Mbuvha, R., 2021. Alleviating class imbalance in 2020. A generative adversarial network–based method for generating
actuarial applications using generative adversarial networks. Risks, 9(3), negative financial samples. International Journal of Distributed Sensor
p.49. Networks, 16(2), p.1550147720907053.
[7] Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B. [21] Ibtissam Benchaji, Samira Douzi, and Bouabid El Ouahidi, "Credit Card
and Bharath, A.A., 2018. Generative adversarial networks: An overview. Fraud Detection Model Based on LSTM Recurrent Neural Networks,"
IEEE signal processing magazine, 35(1), pp.53-65 Journal of Advances in Information Technology, Vol. 12, No. 2, pp.
[8] Gui, J., Sun, Z., Wen, Y., Tao, D. and Ye, J., 2021. A review on 113-118, May 2021. doi: 10.12720/jait.12.2.113-118.
generative adversarial networks: Algorithms, theory, and applications. [22] Maria R. Lepoivre, Chloé O. Avanzini, Guillaume Bignon, Loïc
IEEE Transactions on Knowledge and Data Engineering. Legendre, and Aristide K. Piwele, "Credit Card Fraud Detection with
[9] Pandey, A., Bhatt, D. and Bhowmik, T., 2020. Limitations and Unsupervised Algorithms," Vol. 7, No. 1, pp. 34-38, February, 2016. doi:
Applicability of GANs in Banking Domain. In ADGN@ ECAI. 10.12720/jait.7.1.34-38.
[10] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D.,
Ozair, S., Courville, A. and Bengio, Y., 2014. Generative adversarial
nets. Advances in neural information processing systems, 27.

You might also like