Credit Card Fraud Detection Using A Deep Learning Multistage Model
Credit Card Fraud Detection Using A Deep Learning Multistage Model
https://doi.org/10.1007/s11227-022-04465-9
Abstract
The banking sector is on the eve of a serious transformation and the thrust behind it
is artificial intelligence (AI). Novel AI applications have been already proposed to
deal with challenges in the areas of credit scoring, risk assessment, client experience
and portfolio management. One of the most critical challenges in the aforementioned
sector is fraud detection upon streams of transactions. Recently, deep learning mod‑
els have been introduced to deal with the specific problem in terms of detecting and
forecasting possible fraudulent events. The aim is to estimate the unknown distribu‑
tion of normal/fraudulent transactions and then to identify deviations that may indi‑
cate a potential fraud. In this paper, we elaborate on a novel multistage deep learn‑
ing model that targets to efficiently manage the incoming streams of transactions
and detect the fraudulent ones. We propose the use of two autoencoders to perform
feature selection and learn the latent data space representation based on a nonlinear
optimization model. On the delivered significant features, we subsequently apply a
deep convolutional neural network to detect frauds, thus combining two different
processing blocks. The adopted combination has the goal of detecting frauds over
the exposed latent data representation and not over the initial data.
* Georgios Zioviris
[email protected]
Kostas Kolomvatsos
[email protected]
George Stamoulis
[email protected]
1
Department of Electrical and Computer Engineering, University of Thessaly, Glavani 37,
38221 Volos, Greece
2
Department of Informatics and Telecommunications, University of Thessaly, Papasiopoulou
2‑4, 35131 Lamia, Greece
13
Vol.:(0123456789)
14572 G. Zioviris et al.
1 Introduction
Recent studies reveal that the fraud detection market is estimated to worth $19.5 bil‑
lion1. This amount will be increased in the future with estimates talking about $63
billion by 2023. According to the consumer sentinel network of the USA, of the 3.2
million identity theft and fraud reports received in 2019, 1.7 million were fraud-
related. Of the 1.7 million fraud cases, 23% reported money was lost2. Fraud man‑
agement is a key research subject for multiple areas like insurance claims, money
laundering, electronic payments, bank transactions, etc. Fraud should be detected in
the minimum possible time upon streams that convey transactional data. Obviously,
rich datasets can be formulated in the financial services sector with a high complex‑
ity and volumes. The challenge for banking institutions then is to quickly detect and
separate fraudulent transactions from legitimate with no impact on customer experi‑
ence. Traditionally, the methods adopted for fraud detection are related to manual
intervention or rule-based models with limited success3 The manual intervention/
detection suffers by an increased time while rule-based approaches deal with com‑
plex sets of conditions that should be met before a transaction is characterized as
suspicious. Additionally, in both scenarios (manual detection and rule-based sys‑
tems), an increased effort is required to deliver the discussed conditions upon which
a transaction may be characterized as fraudulent. Machine learning (ML) can be the
solution to these problems and especially deep ML (DML) that is capable of identi‑
fying more complex patterns upon huge volumes of data.
The application of ML in fraud detection may allow banking institutions to dis‑
cern genuine transactions from fraudulent events in real time. Apart from the tem‑
poral aspect, ML can assist in the definition of models that exhibit higher accuracy
than other schemes4. An interesting approach is to mix multiple types of ML models
(e.g. supervised and unsupervised methods) to increase the ‘detection’ capability of
the, so-called, ensemble scheme related to hidden aspects of the distribution, lead‑
ing to the recognition of new patterns in fraud management. Evidently, any efficient
learning methodology that will reveal the hidden characteristics of data will lead
to the recognition of new patterns in fraud management. In any case, the need for
the discussed models both, simple and ensemble schemes, is imperative due to the
financial impact of frauds. Apart from the detection accuracy, ML schemes can sup‑
port automated methods for fraud detection; thus, the desired tasks that are manu‑
ally executed will be realized in an automated manner requiring less intervention by
employees. The speed up of processing will increase the throughput of the financial
institutions being capable of dealing with the exponentially growing of the domain
of fraud detection and the relevant data.
In this paper, we are testing a model that performs better than several machine
learning methods in terms of recall, the metric that shows which of the transactions
1
https://www.statista.com/statistics/786778/worldwide-fraud-detection-and-prevention-market-size
2
https://www.ftc.gov/reports/consumer-sentinel-network-data-book-2019
3
https://interceptd.com/how-is-machine-learning-used-in-fraud-detection
4
https://www.netguru.com/blog/fraud-detection-with-machine-learning-banking
13
Credit card fraud detection using a deep learning multistage… 14573
have been classified correctly as fraud over the total amount of frauds. A unique
ensemble model is proposed, which combines a deep autoencoder that is respon‑
sible for the feature reduction of the dataset, and a convolutional neural network
that classifies the transactions as frauds or not. Our feature selection approach is
adopted to reduce the computational burden of the upcoming classification process
performed by our CNN. The monitoring mechanism observes huge volumes of data
generated by electronic transactions. In other words, fraud is a special case of out‑
liers detection. The high imbalanced distribution of the classification classes (i.e.
fraud or normal transaction) generates new challenges that may be met adopting a
feature selection methodology. The goal of the proposed approach is to detect the
most appropriate (or stable) features that are not dependent on the dataset. However,
to avoid to select features in favour of the dominant class (as fraud events are far
less than the normal transactions), we rely on various oversampling techniques to
train our model. In this paper, we focus on an ‘adaptive’ model capable of detect‑
ing new, complex fraud patterns through the use of advanced DML schemes. The
adaptivity aspect of our approach is related to the underlying data and their effi‑
cient management together with the support of streaming environments. Our scheme
takes into consideration and reveals the latent data representation before it proceeds
to the final decision. This is a critical aspect of our work as we try to classify any
incoming transaction based on the features that are significant for the observed data.
We provide a methodology that avoids the disadvantages of legacy tools. We rely
on advanced DML to have the best possible performance. DML has been widely
adopted in many domains like image processing, speech recognition and natural lan‑
guage processing (NLP) [1]. DML advantages are revealed through the use of, e.g.
autoencoders [2], CNNs [3] and recurrent neural networks (RNNs) [4]. In recent
years, deep learning attracts tremendous growth and attention as the outcome of
more powerful hardware, larger datasets and techniques to train deeper networks5
We build upon the increased accuracy of ML/DML that can support institutions
in fraud detection though the elimination of false positive alerts, i.e. transactions
that are incorrectly classified as frauds. Our model reduces false negatives as well,
i.e. missing real fraud incidents. We also deal with the problem of unbalanced data
in the training dataset. We adopt the synthetic minority oversampling technique
(SMOTE) technique and its variants utilizing the k-nearest neighbours (kNN) algo‑
rithm to identify minority classes in the training dataset and learn their features.
SMOTE can conclude a dataset that is balanced concerning the desired classes
ensuring that data fed to our ML/DML models will be more resistant to overfitting.
Nevertheless, to expose the performance and a comparative assessment of multi‑
ple over sampling techniques, we incorporate into our scheme the following models:
(i) SMOTE [5], (ii) Borderline SMOTE (minority examples near the borderline are
oversampled)[6], (iii) support vector machine (SVM) SMOTE (new minority class
instances are generated near borderlines with the use of SVM to assist establishing
boundary between classes) [7], (iv) k-Means SMOTE (minority class instances are
5
Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.
deeplearningbook.org.
13
14574 G. Zioviris et al.
generated in safe and critical areas of the input space) [8], and (v) Adaptive Syn‑
thetic (ADASYN) that adopts a weighted distribution for different minority class
instances based on their level of difficulty in the learning process [9].
The novelty of our approach is presented by the following list.
The rest of the paper is organized as follows. Section 2 reports on the related work
while Sect. 3 describes the problem under consideration. Section 4 discusses the
proposed approach and Sect. 5 analytically presents our DML models. In Sect. 6, we
present the envisioned experimental evaluation of the proposed approach while in
Sect. 7, we compare our results with the results of two other latest papers present in
their work. Lastly, in Sect. 8, we conclude our paper by exposing our future research
plans.
2 Related work
Many techniques have been applied to maximize the detection rate of frauds through
the adoption of DML/ML techniques. The interested reader can find a relevant
survey in [10]. ML models involve neural networks (NNs), decision trees, genetic
algorithms, while outlier detection techniques can be also adopted for the identi‑
fication of frauds [10]. The adoption of the aforementioned ML schemes requires
the modelling of the environment and the solution space as well as a training phase
13
Credit card fraud detection using a deep learning multistage… 14575
(it is the usual scenario for the majority of models). In [11], the authors present an
experimental comparison of various classification algorithms such as random for‑
ests and gradient boosting classifiers for unbalanced scoring datasets. The presented
outcome depicts that random forests and gradient boosting algorithms outperform
the remaining models involved in the comparison (e.g. C4.5, quadratic discriminant
and k-nearest neighbours—kNNs). However, the complexity of these approaches
may jeopardize the ‘visibility’ of the internal processes and lead to consider them
as ‘black boxes’. In [12], the authors conclude that SVMs improve the accuracy
of events detection compared to logistic regression, linear discrimination analysis
and kNNs. A survey on SVMs introduces the application of the technology and
the techniques adopted to predict defaults using broad and narrow definitions [13].
In any case, SVMs are not suitable for the management of large data sets and do
not perform well when noise is present in data (e.g. overlapping classes). Another
effort presented in [14] tries to evaluate ML models (SVMs, bagging, boosting and
random forests) to predict bankruptcy one year prior to the event. The authors also
compare the performance of the adopted algorithms with results retrieved by dis‑
criminant analysis, logistic regression and NNs. The aforementioned attempt evalu‑
ates the strength of ensemble models over single classifiers focusing on the accuracy
of the outcomes. However, bagging may suffer from high bias if it is not modelled
properly leading to underfitting while becoming computationally expensive when
large scale data is the case. Boosting cannot, usually, be implemented in real time
due to its increased complexity and may result in multiple parameters having direct
effects on the behaviour of the model. Authors et al. in [15] proposed the Precison‑
Rank and the total detection cost as the appropriate metrics for measuring the detec‑
tion performance in credit datasets. In an additional effort presented in [16], the
authors focus on an effective learning strategy for addressing the verification latency
and the alert–feedback interaction problem, while they propose a formalization of
the fraud detection problem that realistically describes the operating conditions of
FDSs that everyday analyse massive streams of credit card transactions. A denoising
autoencoder for credit risk analysis has been introduced to remove the noise from
the dataset [17]. Denoising autoencoders often yield better representations when
trained on corrupted versions of a dataset; thus, they can capture information and fil‑
ter out noise more effectively than traditional methods [17]. A deep autoencoder and
a restricted Boltzmann machine (RBM) that can reconstruct normal transactions to
finally find anomalies, have been applied to a credit card dataset for fraud detection
[1]. The authors conclude that the combination of the autoencoder with the RBM
outperforms other techniques when the training dataset is large enough to train them
efficiently [1]. Sparse autoencoders and generative adversarial networks (GANs)
have been also adopted to detect potential frauds [18]. The discussed models can
achieve higher performance than other state-of-the-art one-class methods such as
one-class Gaussian process (GP) and support vector data description (SVDD). In
general, autoencoders may be somehow limited in the processes that can perform.
One potential use may be the pre-training of a model to get the dataset latent repre‑
sentation and isolate the most significant features. This means that for concluding
a classification process, autoencoders should be combined with other schemes. In
[19], the authors introduce a hybrid ‘Relief-CNN’ model, i.e. a combination of the
13
14576 G. Zioviris et al.
following techniques: a CNN and the Relief algorithm. The Relief algorithm is a
filter method approach to feature selection that is notably sensitive to feature interac‑
tions. This algorithm calculates a feature score for each feature which can then be
applied to rank and select top scoring features for feature selection. The utilization
of the relief algorithm can efficiently reduce the size of an image pixel matrix, which
can reduce the computational burden of the CNN. The authors in [20], expand the
labelled data through their social relations to get the unlabelled data and propose
a semi-supervised attentive graph neural network, named SemiGNN to utilize the
multi-view labelled and unlabelled data for fraud detection. Moreover, they propose
a hierarchical attention mechanism to better correlate different neighbours and dif‑
ferent views. In [21] the authors propose a method to combine label propagation
and transductive support vector machine (TSVM) with Dempster–Shafer theory for
accurate default prediction of social lending using unlabelled data. In order to train a
lot of data effectively, they ensemble semi-supervised learning methods with differ‑
ent characteristics. Label propagation is performed so that data having similar fea‑
tures are assigned to the same class and TSVM makes moving away data having dif‑
ferent features. Lastly, the authors of [22] use various machine learning algorithms,
with and without the usage of the AdaBoost and majority voting algorithm, in order
to detect fraudulent transactions.
3 Problem definition
6
Basel, Committee. (2006). International Convergence of Capital Measurement and Capital Standards:
A Revised Framework, Comprehensive Version. Switzerland: Bank for International Settlements.
13
Credit card fraud detection using a deep learning multistage… 14577
vector incorporating the details of each transaction, i.e. Ti = ⟨xi1 , xi2 , … , xiM ⟩ (M is
the number of the envisioned features). Our model incorporates two parts: (i) the
monitoring part which is responsible to receive transactions from Q and get their
features’ values transferring them to the second part of the proposed model. The
monitoring activity targets to prepare the incoming data for further processing; (ii)
the classification part which is responsible to classify every{transaction as a fraudu‑
lent or not upon historical data depicted by the set H . H = Tj with j ≤ i is a data
}
structure where we store past, processed transactions to become the basis for behav‑
ioural patterns detection and, thus, fraud detection. Our goal is to classify every Ti as
a fraudulent transaction or not. This means that after receiving Ti , we have to decide
amongst the following:
• Decision D1. Taking into consideration the values of each variable in Ti and his‑
torical transactions, classify Ti as a fraudulent transaction iff Ti significantly devi‑
ates from the ‘normal’ transactional pattern;
• Decision D2. If Ti does not represent a significant deviation from the ‘normal’
transactional patterns, classify Ti as non-fraudulent.
T being part of P are past transactions that users performed. In the above described
′
process, we can identify two significant aspects that deal with the efficient definition
of P and the adoption of a function z(⋅) that delivers the deviation of Ti from P . For
the former, we have to adopt any pattern extraction algorithm proposed in the rele‑
vant literature or rely on the proposed DML/ML models to improve the performance
(as we show later). In our case, P will be realized through the learning procedure to
derive the ‘shape’ of proposed DML NNs. This means that through the provided
training dataset, our DML NNs will conclude to be aligned with the hidden charac‑
teristics of the underlying data, thus, to realize P and the corresponding class. The
significant is that performing the discussed training and feeding the NNs with fraud‑
ulent transactions as well, our DML model is capable of learning fraud patterns P′ .
Our target is to learn the distribution of both P and P′ , then, to be capable of detect‑
ing similarities upon them and classify Ti as fraudulent or not. For the latter, z(⋅) is,
actually, a classification function that delivers the final decision. z(⋅) gets as inputs P
and P′ and Ti and performs a two-class
{ } classification process. The following equa‑
tion holds true: z(P, P� , Ti ) → C, C where C = 1 ( C = 0 ) represents a fraudulent
transaction. z(⋅) is realized by the last layers of the adopted CNN (see below for an
analysis of the proposed convolutional network).
We have to notice that the aforementioned process is performed after a dimen‑
sionality reduction phase to reveal the most significant parameters of H . Evidently,
this phase will expose strong correlations between the features incorporated in the
13
14578 G. Zioviris et al.
envisioned transactions giving the opportunity to keep only the most significant
ones (obviously, these features are aligned with the needs of the discussed problem).
Our approach involves an autoenconder { that builds}upon H and results a new subset
of features, i.e. f (xi , x2 , … , xM ) → xi , x2 , … , xM� with M ′ ≤ M . f (⋅) is delivered
by the aforementioned autoencoder and targets to learn the compressed or decom‑
pressed representation of the input data. Hence, f (⋅) that can be adopted to recon‑
struct the initial data with a low error giving the opportunity to focus only on the
most important features.
4 Proposed solution
4.1 High‑level architecture
13
Credit card fraud detection using a deep learning multistage… 14579
Fig. 2 High-level architecture of the second proposed approach using a deep variational autoencoder
which is mainly affected by false positive events, i.e. normal transactions that are
classified to class C. Hence, the elimination of false negative events becomes the
motivation for the adoption of the ‘ensemble’ DML model that is capable of pro‑
cessing of complex datasets as well. False negatives can be avoided if we enhance
the learning process of the hidden aspects of the adopted datasets. Compared with
other efforts in the respective literature, we go a step forward and identify the ‘hid‑
den’ aspects of fraud detection. We can discern the features that exhibit strong cor‑
relations with the remaining ones. For instance, we propose a more efficient dimen‑
sionality reduction technique via the deep autoencoder that could lead to more
accurate classification results.
For preparing the data before they are consumed by the aforementioned DML
models, we rely on a preprocessing phase. All features are scaled to the unity inter‑
val to ensure that no extreme values are present in the dataset. For the scaling action,
we adopt the min–max normalization technique. Then, following [5], we create new
instances of the minority class, while, in parallel, we reduce the instances of the
majority class (i.e. C ) by implementing oversampling techniques (see next subsec‑
tion for more details). In general, oversampling focuses on the enhancement of the
minority class (i.e. C) to adjust the class distribution of our dataset, thus, to achieve
better performance. The newly created dataset is adopted to train the autoencoder.
For the experiments regarding the simple autoencoder, we adopt three (3) hidden
layers and the exponential linear unit (ELU) as the activation function, and gradu‑
ally, we reach up to ten (10) nodes (features) from the initial 30 ones of dataset. We
consider ELU as it performs better than other activation functions. The activation
function ELU tends to converge the cost to zero more quickly (it can derive negative
values allowing the network to push the mean activation closer to zero) than other
functions while being capable of producing more accurate results. In general, ELU
13
14580 G. Zioviris et al.
activation function decreases the gap between the normal gradient and the unit natu‑
ral gradient, and thereby, it achieves to speed up the learning process [23]. For the
experiments that use a variational autoencoder, we use two (2) hidden layers and we
gradually reach up to ten (10) nodes (features) from the initial 30 ones of the dataset.
As the activation function, we adopt tanh in every layer relying on a set of simu‑
lations to reveal the performance of multiple activation functions and choose the
best one. Finally, we have to notice that the model loss is calculated with the binary
cross-entropy loss function. After the end of the training process, we get the new
feature reduced dataset from the hidden layer and eliminate redundancies in the fea‑
ture representation space. For instance, in a dataset with thirty (30) features, we can
conclude in only fifteen (10) significantly reducing the data space upon which we
deliver the final classification outcome. The above processing increases the speed
of processing with positive impact in our model as we target to support a streaming
environment where numerous transactions are collected. After the discussed step,
we create a new encoded (and reduced) feature dataset which is fed to the CNN,
which have an input layer, two (2) hidden layers and an output layer. In each layer,
we implement a pooling layer, batch normalization and the dropout method to avoid
overfitting. In the first two (2) layers, we adopt the ReLU activation function and for
the output layer the activation process is performed by a Sigmoid function. Our deci‑
sion for adopting the specific activation functions is concluded through an extensive
experimentation that reveals their performance for the specific problem. The model
loss is calculated with the binary cross-entropy loss function (cross-entropy minimi‑
zation is frequently used in optimization and rare event probability estimation). The
CNN is evaluated with the assistance of a test set to evaluate its performance. Code
and data for replicating our experiments are available at https://github.com/zioviris/
Credit_Card.
13
Credit card fraud detection using a deep learning multistage… 14581
imbalances between and within classes, by employing the k-means clustering algorithm
in combination with SMOTE oversampling [8]. In Borderline SMOTE technique,
for every minority instance, its k nearest neighbours of the same class are extracted
and some of them are randomly selected based on the oversampling rate [6]. In SVM
SMOTE technique, the method first preprocesses the data by oversampling the minor‑
ity instances in the feature space, then the pre-images of the synthetic samples are
found based on a distance relation between feature space and input space. Finally, these
pre-images are appended to the original dataset [7] Finally, the last technique that is
adopted in this paper is the ADASYN [9]. ADASYN is based on the idea of using a
weighted distribution for instance of C according to their level of difficulty in learn‑
ing. Synthetic data are generated for C being harder to learn than to C instances that
are easier to learn. ADASYN improves the learning ability w.r.t. data distributions and
reducing the biases introduced by the class imbalance problem. The final target is to
adaptively shift the classification decision boundary towards the space of the ‘difficult’
instances.
5 Theory
Autoencoders are NNs that are, traditionally, used for dimensionality reduction target‑
ing to data representations that may improve the performance while consuming less
memory and run time. Autoencoders are trained to be capable of reproducing their
inputs to outputs. They adopt a hidden layer h(x� ) trained to depict the provided input.
An autoenconder may be viewed as a model containing two parts: (i) the encoder func‑
tion f(x) (x is the input into the autoencoder) and (ii) a decoder scheme that produces a
reconstruction of the initial input g(h). The following equations hold true:
The encoder function, denoted by f(x), maps the original data x to a latent space h(x� )
which is present at the hidden layer before they are reproduced by the decoder. The
decoder function, denoted by g(h), maps the latent space h(x� ) at the hidden layer
to the output which is the same as the input. The discussed encoding network can
be represented by a standard NN function transferred through an activation func‑
tion, where l is the latent dimension, i.e. (W and b are the weights and biases of the
layers),
l = 𝜎(Wx + b) (4)
13
14582 G. Zioviris et al.
Similarly, the decoder can be depicted in the same way, however, with different
weights, biases and potentially activation functions. The decoding phase can be rep‑
resented by the following equation ( W and b are the weights and biases of the hid‑
′ ′
den layer):
(5)
� � � �
x = 𝜎 (W l + b )
In the proposed autoencoder, we adopt a loss function written in terms of the afore‑
mentioned functions. The loss function is utilized to train the NN through the back‑
propagation algorithm. Through backpropagation, our models is continuously seek‑
ing to limit the error between the calculated values and target outputs. This forces
the hidden layer to use dimensionality reduction and eliminate noise while recon‑
structing the inputs especially when the number of neurons in the hidden layer is
low. The following equations holds true:
(6)
� � � � �
L(x, x ) = ‖x − x ‖2 = ‖x − 𝜎 (W (𝜎(Wx + b)) + b )‖2
In the above equation, DKL depicts the Kullback–Leibler divergence. The prior over
the latent features is usually set to be the centred isotropic multivariate Gaussian, i.e.
p𝜃 (𝐡) = N(𝟎, 𝐈) (8)
Commonly, the shape of the variational and the likelihood distributions are chosen
such that they are factorized Gaussian distributions:
13
Credit card fraud detection using a deep learning multistage… 14583
CNNs was first proposed by [24] as the technology adopted to manage data as a
‘picture’ in the two-dimensional space. CNNs have been applied in various domains
like image recognition, object detection, classification, video recognition, recom‑
mender systems, image classification, medical image analysis, natural language pro‑
cessing (NLP) and financial time series. A CNN employs a special operation named
convolution which is a specialized type of linear processing instead of a generic
matrix multiplication, in at least one of its layers. In a CNN hidden layers involve a
set of convolutional processing that convolve with multiplication, other dot product
or cross-correlation calculations. This is significant for the indices within the data
matrix, thus, it affects how weights are set at a selected index point7. ReLU is a
widely adopted activation function followed by additional convolutions like pooling
layers, fully connected layers and normalization layers. In general, convolution is an
operation described by a smoothed estimate of the input function x(t),
∫ (11)
s(t) = x(a)w(t-a)da
where w(a) is the kernel function in the form of a valid probability density function
and s(t) is the output. The convolutional operation is usually depicted by an asterisk:
s(t) = (x*w)(t) (12)
Alternatively, we can define the discrete convolution as the following equation
depicts:
∑
s(t) = (x ∗ w)(t) = x(a)w(t-a) (13)
Usually, we use convolutions for multiple axis at a specific time. So, the above func‑
tions can be expressed as:
∑∑
S(i, j) = (I ∗ K)(i, j) = I(m,n)K(i-m,j-n) (14)
7
Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.
deeplearningbook.org.
13
14584 G. Zioviris et al.
where K and I are the used kernel and provided input, respectively. Equivalently, we
can exploit the commutative ability of the convolution operation:
∑∑
S(i, j) = (K ∗ I)(i, j) = I(i-m,j-n)K(m,n) (15)
The motivation behind the use of such a technique is that convolution leverages
three important ideas that can help improve a machine learning system, i.e. (i) sparse
interactions; (ii) parameter sharing; and (iii) equivariant representations8.
6 Results and discussion
We report on the evaluation of the proposed model upon a real dataset. Our dataset
contain credit cards transactions made in September 2013 by European cardhold‑
ers collected during a research collaboration of Worldline and the Machine Learn‑
ing Group (http://mlg.ulb.ac.be) of Université Libre de Bruxelles on big data min‑
ing and fraud detection [15, 16, 25–29]. The adopted dataset depicts transactions
occurred in two (2) days, where 492 fraudulent out of 284,807 transactions are pre‑
sent. The dataset is highly unbalanced, the C class (frauds) account for 0.172% of all
the available transactions. It contains only numerical features fed into our encoder.
The feature ‘Class’ in the dataset is the classification outcome that is realized equal
to unity ( C = 1 ) in case of a fraud and zero, otherwise. In our experimentation, we
perform feature scaling applying the min–max normalization (i.e. we subtract the
minimum of every feature, then dividing by the range of the feature).
Six performance metrics are adopted to evaluate our model, i.e. accuracy ( 𝛼 ), pre‑
cision ( 𝜖 ), recall ( 𝜁 ), the Matthews correlation coefficient or MCC ( 𝜇 ), the F1-score
( 𝛿 ) and the area under curve or simply (AUC). 𝜖 is the fraction of true events (i.e.
frauds) amongst all samples which are classified as frauds, while 𝜁 is the fraction
of frauds which have been classified correctly over the total amount of frauds. 𝛿 is
a performance metric that combines both, 𝜖 and 𝜁 . The Matthews Correlation Coef‑
ficient (MCC) is a machine learning measure which is used to check the balance
of the binary (two-class) classifiers. It takes into account all the true and false val‑
ues that is why it is generally regarded as a balanced measure which can be used
even if there are different classes. AUC provides an aggregate measure of perfor‑
mance across all possible classification thresholds. The area under the curve (often
referred to as simply the AUC) is equal to the probability that a classifier will rank
a randomly chosen positive instance higher than a randomly chosen negative one
(assuming ‘positive’ ranks higher than ‘negative’). It is also common to calculate
the Area Under the ROC Convex as any point on the line segment between two pre‑
diction results that can be achieved by randomly using one or the other system with
8
Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.
deeplearningbook.org.
13
Credit card fraud detection using a deep learning multistage… 14585
TP
𝜖= (17)
TP + FP
TP
𝜁= (18)
TP + FN
𝜖⋅𝜁
𝛿 =2 ⋅ (20)
𝜖+𝜁
In the above equations, TP (true positive) is the number of frauds which have been
classified correctly. FP (false positive) is the number of normal transactions which
have been classified as frauds. FN (false negative) is the number of frauds which
have been classified as normal ones. TN (true negative) is the number of normal
transactions that have been classified as normal.
13
14586 G. Zioviris et al.
6.2 Performance assessment
13
Credit card fraud detection using a deep learning multistage… 14587
a different partition used for testing at each time. In our experiments, we use
m ∈ {3, 4, 5, 10} . Our results are shown in Tables 1, 2, 3 and 4, respectively.
In the following figures, we present our experimental evaluation outcomes for
combinations of the main aforementioned models, i.e. AE-CNN and VAE-CNN.
In this set of experiments, we consider m ∈ {4} . In Fig. 3, we observe the perfor‑
mance of the AE-CNN core, for m = 4 . In Fig. 4, we observe the performance of
the VAE-CNN core, for m = 4.
6.3 Results
When we adopt a m = 3 fold cross-validation, the best model regarding 𝜁 (i.e. recall,
which represents the number of fraudulent transactions classified correctly) is the
combination of ADASYN-VAE-CNN 1 ( 𝜁 = 93.50% ). Nevertheless, in terms of 𝜖 ,
𝛿 and 𝜇 , the discussed model exhibits the worst performance. Recall that 𝜖 , repre‑
sents the number of normal transactions classified as fraudulent. In absolute terms,
1,300 normal transactions are classified as fraudulent out of a number of 80,000
transactions in total. The best model, in terms of 𝜖 , 𝛼 , 𝜇 and 𝛿 , is the scheme adopt‑
ing a single CNN 1 ( 𝜖 = 90.50% , 𝛼 = 99.97%, 𝜇 = 83.60% and 𝛿 = 83.90% ). How‑
ever, the second best model in terms of 𝜁 is the K-means SMOTE-AE-CNN model 1
( 𝜁 = 77.44% ). The important thing about the model is that it performed adequately
in terms of 𝜖 , 𝛿 and 𝜇 , in contrast to the aforementioned ADASYN-VAE-CNN.
When m = 4 , the best model regarding 𝜁 is the ADASYN-VAE-CNN model 2
( 𝜁 = 93.70% ). However, in terms of 𝜖 and 𝛿 exhibits the worst performance. The
best model in terms of 𝜖 , 𝛼 , 𝜇 and 𝛿 is the CNN 2 model ( 𝜖 = 91.20% , 𝛼 = 99.98% ,
𝜇 = 83.49% and 𝛿 = 84.49% ). An interesting result is that the AE-CNN 2 model has
the second best performance in terms of 𝜁 ( 𝜁 = 79.27% ) outperforming the remain‑
ing models except the aforementioned ADASYN-VAE-CNN model. The important
13
14588 G. Zioviris et al.
Fig. 3 Performance of AE-CNN for m = 4 (up: AE-CNN model performance; bottom: AE-CNN ROC
curve)
thing about the model is that it performed adequately in terms of 𝜖 , 𝛿 and 𝜇 , in con‑
trast to the aforementioned ADASYN-VAE-CNN.
With m = 5 , the best model regarding 𝜁 is the ADASYN-VAE-CNN 3 combina‑
tion ( 𝜁 = 93.90% ). The best model in terms of 𝜖 , 𝜇 , 𝛼 and 𝛿 is the individual CNN 3
( 𝜖 = 91.01% , 𝛼 = 99.99% , 𝜇 = 84.50% , 𝛿 = 84.15% ). Interestingly, the VAE-CNN 3
combination outperforms the other models in terms of 𝜁 , ( 𝜁 = 78.25% ) while, at the
same time, it performed adequately in terms of 𝜖 , 𝛿 and 𝜇 , in contrast to the aforemen‑
tioned ADASYN-VAE-CNN.
With m = 10 , the best performed model regarding 𝜁 is the combination between
ADASYN-VAE-CNN 4 models ( 𝜁 = 94.11% ). In terms of 𝜖 𝜇 and 𝛿 , the CNN 4 model
has the best performance ( 𝜖 = 91.27% , 𝜇 = 83.70% and 𝛿 = 84.47% ). An interesting
result is that the VAE-CNN 4 model has the second best performance in terms of recall,
i.e. 𝜁 = 78.86% , outperforming the remaining models (except the ADASYN-VAE-CNN
13
Credit card fraud detection using a deep learning multistage… 14589
Fig. 4 Performance of VAE-CNN Model for m = 4 (up: VAE-CNN model performance; bottom: VAE-
CNN ROC curve)
models as mentioned above) while, at the same time, it performed adequately in terms
of 𝜖 , 𝛿 and 𝜇 , in contrast to the aforementioned ADASYN-VAE-CNN.
6.4 Discussion
In this paper, the proposed method tries to perform better than conventional ML
methods in terms of recall, the metric that shows which of the transactions have
been classified correctly as fraud over the total amount of frauds. Keeping that in
mind, the ADASYN-VAE-CNN model outperforms the rest of the models by far
(i.e. Average recall = 93.80% approximately). While recall is important, we cannot
overlook the poor performance of the model in terms of precision (Average Preci‑
sion = 1.25% approximately), which leads of course to a poor performance in terms
of F1-score and MCC. Even then, in absolute terms, 1,300 normal transactions were
13
14590 G. Zioviris et al.
13
Credit card fraud detection using a deep learning multistage… 14591
13
14592 G. Zioviris et al.
7 Comparative assessment
In this section, we compare our models with the results that the authors of [1] and
[22] present in their work, considering that the same dataset is used, so the compari‑
son is easier to implement. In the first paper, the authors use a deep autoencoder and
a RBM model, in order to outperform other techniques. In this study the authors use
the hyperbolic tangent function or ‘Tanh’ function to encode and decode the input
to the output. As a sample of a neural network, when they have already used the AE
model, they reconstruct the error by using backpropagation. Backpropagation com‑
putes the ‘error signal’, propagates the error backwards through network that starts
at the output units by using the condition that the error forms the difference between
the actual and desired output values. Based on the AE, they use parameter gradients
for realizing backpropagation. They used hidden layers by having 3 encoders and 3
decoders, As mentioned above, every hidden layer they used was the ‘Tanh’ activa‑
tion function, while they divide the train and test with 80 and 20 percentage of data
by using normal transactions to predict fraudulent transactions. Another algorithm
that the authors use is RBM. There are two structures in this algorithm, visible or
input layer and hidden layer. Each input node takes the input feature from the dataset
to be learned. The design is different from other deep learning, because there is no
output layer. The output of RBM is getting the reconstruction back to the input. The
point of RBM is the way in which they learn by themselves for data reconstruction.
The only metric that is used for comparison is the AUC score.
13
Credit card fraud detection using a deep learning multistage… 14593
In [22], the authors use various machine learning algorithms, including Naive
Bayes, SVM, Decision Tree, Random Forest, Gradient Boosted Tree, Decision
Stump, MLP, with and without the use of AdaBoost and majority voting algorithms,
with the scope of detecting fraudulent transactions. The metrics Matthew Coefficient
Correlation (MCC) and accuracy are used for comparison. None of those papers
provide sufficient information about precision, recall and F1 score. In addition, none
of these papers reports anything about the usage of K-fold. In every comparison, the
metrics of our models are the ones of the number of folds that give the best results,
following the previous tables.
From this comparison, we can see that the majority of our models perform better
than the models of the authors of [1]. The best of our models in terms of AUC’s
score performs a score of 99.90%, while the best model in [1] has a score of 96.03%.
Once again, we state that this paper did not provide sufficient information about pre‑
cision, recall, F1 score of any other metric besides AUC’s score.
From this comparison, we can see that the majority of our models with the deep
autoencoder, perform better than the models of the authors of [22]. The best of
our models in performs a score of 99.94%, in terms of accuracy ( 𝛼 , and a score of
82.67% in terms of MCC ( 𝜇 ) while the best model in [22] has a score of 99.94%
in accuracy and a score of 81.30% in terms of MCC. Once again, we state that this
paper did not provide sufficient information about precision, recall, F1 score or any
other metric.
From this comparison, we can see that the majority of our models with the deep
autoencoder, perform better than the models of the authors of [22]. The best of
our models in performs a score of 99.94%, in terms of accuracy ( 𝛼 , and a score of
82.67% in terms of MCC ( 𝜇 ) while the best model in [22] has a score of 99.93%
in accuracy and a score of 80.70% in terms of MCC. Once again, we state that this
paper did not provide sufficient information about precision, recall, F1 score or any
other metric.
From this comparison, we can see that the majority of our models with the deep
autoencoder, perform better than the models of the authors of [22]. The best of
our models in performs a score of 99.94%, in terms of accuracy ( 𝛼 , and a score of
82.67% in terms of MCC ( 𝜇 ) while the best model in [22] has a score of 99.94%
in accuracy and a score of 82.30% in terms of MCC. Once again, we state that this
paper did not provide sufficient information about precision, recall, F1 score or any
other metric.
13
14594 G. Zioviris et al.
8 Conclusions
Author Contributions All authors contributed to the study conception and design. All authors read and
approved the final manuscript.
Funding The authors declare that no funds, grants, or other support were received during the preparation
of this manuscript.
Data Availability Statement The datasets generated analysed during the current study are available in the
[https://www.kaggle.com/kartik2112/fraud-detection-banksim/data repository”.
Declarations
Conflict of interest The authors whose names are listed immediately below certify that they have NO affili‑
ations with or involvement in any organization or entity with any financial interest, or non-financial interest
in the subject matter or materials discussed in this manuscript.
Ethical approval Authors confirm that the appropriate ethics review has been followed.
13
Credit card fraud detection using a deep learning multistage… 14595
References
1. Pumsirirat A, Yan L (2018) Credit card fraud detection using deep learning based on auto-
encoder and restricted Boltzmann machine. Int J Adv Comput Sci Appl 9(1):18–25. https://doi.
org/10.14569/IJACSA.2018.090103 (ISSN 21565570)
2. Pascal V, Hugo L, Isabelle L, Yoshua B, Pierre-Antoine M (2010) Stacked denoising autoencod‑
ers: learning useful representations in a deep network with a local denoising criterion. J Mach
Learn Res 11(110):3371–3408
3. Valueva MV, Nagornov NN, Lyakhov PA, Valuev GV, Chervyakov NI (2020) Application of the
residue number system to reduce hardware costs of the convolutional neural network implemen‑
tation. Math Comput Simul 177:232–243. https://doi.org/10.1016/j.matcom.2020.04.031
4. Dupond S (2019) A thorough review on the current advance of neural network structures. Annu
Rev Control 14:200–230
5. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE synthetic minority over-sam‑
pling technique. J Artif Intell Res 16(February 2017):321–357. https://doi.org/10.1613/jair.953
(ISSN 10769757)
6. Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbal‑
anced data sets learning. Adv Intell Comput 3644:878–887. https://doi.org/10.1007/1153805991
7. Zeng ZQ, Gao J (2009) Improving SVM Classification with Imbalance Data Set. In: Leung CS, Lee
M, Chan JH (eds) Neural Information Processing. ICONIP 2009. Lecture Notes in Computer Sci‑
ence, vol 5863. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10677-444
8. Last F, Douzas G, Bação F (2017) Oversampling for imbalanced learning based on K-Means and
SMOTE
9. He H , Bai Y, Garcia E, Li S (2008) ADASYN: Adaptive Synthetic Sampling Approach for Imbal‑
anced Learning. Proceedings of the International Joint Conference on Neural Networks. 1322–1328.
https://doi.org/10.1109/IJCNN.2008.4633969
10. Prasad NR, Almanza-Garcia S, Lu TT (2009) Anomaly detection. Comput Mater Contin 14(1):1–
22. https://doi.org/10.1145/1541880.1541882 (ISSN 15462218)
11. Iain B, Christophe M (2012) An experimental comparison of classification algorithms for imbal‑
anced credit scoring data sets. Expert Syst Appl 39(3):3446–3453. https://doi.org/10.1016/j.eswa.
2011.09.033
12. Bellotti T, Crook J (2009) Support vector machines for credit scoring and discovery of significant
features. Expert Syst Appl 36(2 PART 2):3302–3308. https://doi.org/10.1016/j.eswa.2008.01.005
(ISSN 09574174)
13. Harris T (2013) Quantitative credit risk assessment using support vector machines: Broad versus
Narrow default definitions. Expert Syst Appl 40(11):4404–4413. https://doi.org/10.1016/j.eswa.
2013.01.044 (ISSN 09574174)
14. Barboza F, Kimura H, Altman E (2017) Machine learning models and bankruptcy prediction.
Expert Syst Appl 83:405–417. https://doi.org/10.1016/j.eswa.2017.04.006 (ISSN 09574174)
15. Dal Pozzolo A, Caelen O, Borgne Y, Waterschoot S, Bontempi G (2014) Learned lessons in credit
card fraud detection from a practitioner perspective. Expert Syst Appl 41(10):4915–4928. https://
doi.org/10.1016/j.eswa.2014.02.026 (ISSN 09574174)
16. Dal Pozzolo A, Boracchi G, Caelen O, Alippi C, Bontempi G (2018) Credit card fraud detection: a
realistic modeling and a novel learning strategy. IEEE Trans Neural Netw Learn Syst 29(8):3784–
3797. https://doi.org/10.1109/TNNLS.2017.2736643
17. Fan Q, Yang J (2018) A Denoising Autoencoder Approach for Credit Risk Analysis. https://doi.org/
10.1145/3194452.3194456
18. Chen J, Shen Y, Ali R (2019) Credit Card Fraud Detection Using Sparse Autoencoder and Genera‑
tive Adversarial Network. 2018 IEEE 9th Annual Information Technology, Electronics and Mobile
Communication Conference, IEMCON 2018, (May): 1054–1059. https://doi.org/10.1109/IEMCON.
2018.8614815.
19. Zhu B, Yang W, Wang H, Yuan Y (2018) A hybrid deep learning model for consumer credit scoring.
2018 International Conference on Artificial Intelligence and Big Data, ICAIBD 2018, (May):205–
208. https://doi.org/10.1109/ICAIBD.2018.8396195
20. Wang D et al (2019) “A Semi-Supervised Graph Attentive Network for Financial Fraud Detection,”
2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, pp 598-607, https://
doi.org/10.1109/ICDM.2019.00070
13
14596 G. Zioviris et al.
21. Kim A, Cho SB (2019) An ensemble semi-supervised learning method for predicting defaults in
social lending. Eng Appl Artif Intel 81:193–199. https://doi.org/10.1016/j.engappai.2019.02.014
22. Randhawa K, Loo CK, Seera M, Lim CP, Nandi AK (2018) Credit card fraud detection using Ada‑
Boost and majority voting. IEEE Access 6:14277–14284. https://doi.org/10.1109/ACCESS.2018.
2806420
23. Clevert DA (2016) Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learn‑
ing by exponential linear units (ELUs). 4th International Conference on Learning Representations,
ICLR 2016 - Conference Track Proceedings, pages 1–14
24. Lecun Y, Bottou L, Bengio Y, Haffner P (1998) A B7CEDGF HIB7PRQTSUDGQICWVYX HIB
edCdSISIXvg5r ‘CdQTw XvefCdS. proc. OF THE IEEE
25. Dal Pozzolo A (2015) Adaptive machine learning for credit card fraud detection—Dalpoz‑
zolo2015PhD.pdf. (December). URL http://www.ulb.ac.be/di/map/adalpozz/pdf/Dalpozzolo2015P
hD.pdf
26. Dal Pozzolo A, Caelen O, Johnson RA, Bontempi G (2015) Calibrating probability with undersam‑
pling for unbalanced classification. Proceedings—2015 IEEE Symposium Series on Computational
Intelligence, SSCI 2015, (November): 159–166. https://doi.org/10.1109/SSCI.2015.33
27. Carcillo F, Dal Pozzolo A, Borgne YL, Caelen O, Mazzer Y, Bontempi G (2018) SCARFF: a
scalable framework for streaming credit card fraud detection with spark. Inf Fusion 41(Septem‑
ber):182–194. https://doi.org/10.1016/j.inffus.2017.09.005 (ISSN 15662535)
28. Sperduti A, Navarin N, Oneto L (2020) Recent Advances in Big Data and Deep Learning. Proceed‑
ings of the International Neural Networks Society. https://doi.org/10.1007/978-3-030-16841-4
29. Lebichot B, Borgnee YAL, He-Guelton L, Oble F, Bontempi G (2020b) Recent advances in big data
and deep learning. Proceedings of the International Neural Networks Society, 78–88. https://doi.org/
10.1007/978-3-030-16841-4
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.
13