0% found this document useful (0 votes)
87 views26 pages

Credit Card Fraud Detection Using A Deep Learning Multistage Model

This document summarizes a research paper that proposes a multistage deep learning model for credit card fraud detection. The model uses two autoencoders for feature selection and dimensionality reduction. It then applies a convolutional neural network to the latent representations to classify transactions as fraudulent or legitimate. The model aims to efficiently process transaction data streams and detect fraud while addressing challenges like class imbalance. It combines unsupervised and supervised deep learning techniques to leverage their strengths for fraud detection.

Uploaded by

erwisme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views26 pages

Credit Card Fraud Detection Using A Deep Learning Multistage Model

This document summarizes a research paper that proposes a multistage deep learning model for credit card fraud detection. The model uses two autoencoders for feature selection and dimensionality reduction. It then applies a convolutional neural network to the latent representations to classify transactions as fraudulent or legitimate. The model aims to efficiently process transaction data streams and detect fraud while addressing challenges like class imbalance. It combines unsupervised and supervised deep learning techniques to leverage their strengths for fraud detection.

Uploaded by

erwisme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

The Journal of Supercomputing (2022) 78:14571–14596

https://doi.org/10.1007/s11227-022-04465-9

Credit card fraud detection using a deep learning


multistage model

Georgios Zioviris1 · Kostas Kolomvatsos2 · George Stamoulis1

Accepted: 12 March 2022 / Published online: 6 April 2022


© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature
2022

Abstract
The banking sector is on the eve of a serious transformation and the thrust behind it
is artificial intelligence (AI). Novel AI applications have been already proposed to
deal with challenges in the areas of credit scoring, risk assessment, client experience
and portfolio management. One of the most critical challenges in the aforementioned
sector is fraud detection upon streams of transactions. Recently, deep learning mod‑
els have been introduced to deal with the specific problem in terms of detecting and
forecasting possible fraudulent events. The aim is to estimate the unknown distribu‑
tion of normal/fraudulent transactions and then to identify deviations that may indi‑
cate a potential fraud. In this paper, we elaborate on a novel multistage deep learn‑
ing model that targets to efficiently manage the incoming streams of transactions
and detect the fraudulent ones. We propose the use of two autoencoders to perform
feature selection and learn the latent data space representation based on a nonlinear
optimization model. On the delivered significant features, we subsequently apply a
deep convolutional neural network to detect frauds, thus combining two different
processing blocks. The adopted combination has the goal of detecting frauds over
the exposed latent data representation and not over the initial data.

Keywords  Fraud detection · Autoencoder (AE) · Variational autoencoder (VAE) ·


Convolutional neural network (CNN) · Dimensionality reduction · Oversampling
techniques

* Georgios Zioviris
[email protected]
Kostas Kolomvatsos
[email protected]
George Stamoulis
[email protected]
1
Department of Electrical and Computer Engineering, University of Thessaly, Glavani 37,
38221 Volos, Greece
2
Department of Informatics and Telecommunications, University of Thessaly, Papasiopoulou
2‑4, 35131 Lamia, Greece

13
Vol.:(0123456789)
14572 G. Zioviris et al.

1 Introduction

Recent studies reveal that the fraud detection market is estimated to worth $19.5 bil‑
lion1. This amount will be increased in the future with estimates talking about $63
billion by 2023. According to the consumer sentinel network of the USA, of the 3.2
million identity theft and fraud reports received in 2019, 1.7 million were fraud-
related. Of the 1.7 million fraud cases, 23% reported money was lost2. Fraud man‑
agement is a key research subject for multiple areas like insurance claims, money
laundering, electronic payments, bank transactions, etc. Fraud should be detected in
the minimum possible time upon streams that convey transactional data. Obviously,
rich datasets can be formulated in the financial services sector with a high complex‑
ity and volumes. The challenge for banking institutions then is to quickly detect and
separate fraudulent transactions from legitimate with no impact on customer experi‑
ence. Traditionally, the methods adopted for fraud detection are related to manual
intervention or rule-based models with limited success3 The manual intervention/
detection suffers by an increased time while rule-based approaches deal with com‑
plex sets of conditions that should be met before a transaction is characterized as
suspicious. Additionally, in both scenarios (manual detection and rule-based sys‑
tems), an increased effort is required to deliver the discussed conditions upon which
a transaction may be characterized as fraudulent. Machine learning (ML) can be the
solution to these problems and especially deep ML (DML) that is capable of identi‑
fying more complex patterns upon huge volumes of data.
The application of ML in fraud detection may allow banking institutions to dis‑
cern genuine transactions from fraudulent events in real time. Apart from the tem‑
poral aspect, ML can assist in the definition of models that exhibit higher accuracy
than other schemes4. An interesting approach is to mix multiple types of ML models
(e.g. supervised and unsupervised methods) to increase the ‘detection’ capability of
the, so-called, ensemble scheme related to hidden aspects of the distribution, lead‑
ing to the recognition of new patterns in fraud management. Evidently, any efficient
learning methodology that will reveal the hidden characteristics of data will lead
to the recognition of new patterns in fraud management. In any case, the need for
the discussed models both, simple and ensemble schemes, is imperative due to the
financial impact of frauds. Apart from the detection accuracy, ML schemes can sup‑
port automated methods for fraud detection; thus, the desired tasks that are manu‑
ally executed will be realized in an automated manner requiring less intervention by
employees. The speed up of processing will increase the throughput of the financial
institutions being capable of dealing with the exponentially growing of the domain
of fraud detection and the relevant data.
In this paper, we are testing a model that performs better than several machine
learning methods in terms of recall, the metric that shows which of the transactions

1
  https://​www.​stati​sta.​com/​stati​stics/​786778/​world​wide-​fraud-​detec​tion-​and-​preve​ntion-​market-​size
2
  https://​www.​ftc.​gov/​repor​ts/​consu​mer-​senti​nel-​netwo​rk-​data-​book-​2019
3
  https://​inter​ceptd.​com/​how-​is-​machi​ne-​learn​ing-​used-​in-​fraud-​detec​tion
4
  https://​www.​netgu​ru.​com/​blog/​fraud-​detec​tion-​with-​machi​ne-​learn​ing-​banki​ng

13
Credit card fraud detection using a deep learning multistage… 14573

have been classified correctly as fraud over the total amount of frauds. A unique
ensemble model is proposed, which combines a deep autoencoder that is respon‑
sible for the feature reduction of the dataset, and a convolutional neural network
that classifies the transactions as frauds or not. Our feature selection approach is
adopted to reduce the computational burden of the upcoming classification process
performed by our CNN. The monitoring mechanism observes huge volumes of data
generated by electronic transactions. In other words, fraud is a special case of out‑
liers detection. The high imbalanced distribution of the classification classes (i.e.
fraud or normal transaction) generates new challenges that may be met adopting a
feature selection methodology. The goal of the proposed approach is to detect the
most appropriate (or stable) features that are not dependent on the dataset. However,
to avoid to select features in favour of the dominant class (as fraud events are far
less than the normal transactions), we rely on various oversampling techniques to
train our model. In this paper, we focus on an ‘adaptive’ model capable of detect‑
ing new, complex fraud patterns through the use of advanced DML schemes. The
adaptivity aspect of our approach is related to the underlying data and their effi‑
cient management together with the support of streaming environments. Our scheme
takes into consideration and reveals the latent data representation before it proceeds
to the final decision. This is a critical aspect of our work as we try to classify any
incoming transaction based on the features that are significant for the observed data.
We provide a methodology that avoids the disadvantages of legacy tools. We rely
on advanced DML to have the best possible performance. DML has been widely
adopted in many domains like image processing, speech recognition and natural lan‑
guage processing (NLP) [1]. DML advantages are revealed through the use of, e.g.
autoencoders [2], CNNs [3] and recurrent neural networks (RNNs) [4]. In recent
years, deep learning attracts tremendous growth and attention as the outcome of
more powerful hardware, larger datasets and techniques to train deeper networks5
We build upon the increased accuracy of ML/DML that can support institutions
in fraud detection though the elimination of false positive alerts, i.e. transactions
that are incorrectly classified as frauds. Our model reduces false negatives as well,
i.e. missing real fraud incidents. We also deal with the problem of unbalanced data
in the training dataset. We adopt the synthetic minority oversampling technique
(SMOTE) technique and its variants utilizing the k-nearest neighbours (kNN) algo‑
rithm to identify minority classes in the training dataset and learn their features.
SMOTE can conclude a dataset that is balanced concerning the desired classes
ensuring that data fed to our ML/DML models will be more resistant to overfitting.
Nevertheless, to expose the performance and a comparative assessment of multi‑
ple over sampling techniques, we incorporate into our scheme the following models:
(i) SMOTE [5], (ii) Borderline SMOTE (minority examples near the borderline are
oversampled)[6], (iii) support vector machine (SVM) SMOTE (new minority class
instances are generated near borderlines with the use of SVM to assist establishing
boundary between classes) [7], (iv) k-Means SMOTE (minority class instances are

5
  Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning. MIT Press, 2016. http://​www.​
deepl​earni​ngbook.​org.

13
14574 G. Zioviris et al.

generated in safe and critical areas of the input space) [8], and (v) Adaptive Syn‑
thetic (ADASYN) that adopts a weighted distribution for different minority class
instances based on their level of difficulty in the learning process [9].
The novelty of our approach is presented by the following list.

• We provide a hybrid model that combines an autoencoder (deep and variational)


for performing dimensionality reduction that is responsible for the identification
of the most significant features in the processed dataset and a convolutional neu‑
ral network that is responsible for the final classification process. The aim is to
deal with scenarios where a high number of dimensions is present. The autoen‑
coder efficiently learns the representation of data under consideration and gener‑
ates the reduced encoding as a representation as close as possible to the original
input;
• We support the desired classification process with a CNN that results if a trans‑
action is a fraud or not. The discussed classification process is performed upon
the features detected by the proposed autoenconder. The proposed CNN adopts
connectivity patterns that can learn the attributes of the data distribution and
finally detect potential frauds;
• We provide an extensive experimental assessment to reveal the pros and cons
of the proposed schemes in fraud detection. Our evaluation can be adopted as
a comparative study upon the use of multiple DML models and oversampling
techniques while performing their combination;
• We present an extensive comparison amongst 5 oversampling techniques, while
in use with our core model;
• Finally, we compare our model with a set of recently proposed schemes found in
the respective literature. We adopt the same datasets and performance metrics as
the aforementioned schemes in order to secure the fairness of the comparative
assessment.

The rest of the paper is organized as follows. Section 2 reports on the related work
while Sect.  3 describes the problem under consideration. Section  4 discusses the
proposed approach and Sect. 5 analytically presents our DML models. In Sect. 6, we
present the envisioned experimental evaluation of the proposed approach while in
Sect. 7, we compare our results with the results of two other latest papers present in
their work. Lastly, in Sect. 8, we conclude our paper by exposing our future research
plans.

2 Related work

Many techniques have been applied to maximize the detection rate of frauds through
the adoption of DML/ML techniques. The interested reader can find a relevant
survey in [10]. ML models involve neural networks (NNs), decision trees, genetic
algorithms, while outlier detection techniques can be also adopted for the identi‑
fication of frauds [10]. The adoption of the aforementioned ML schemes requires
the modelling of the environment and the solution space as well as a training phase

13
Credit card fraud detection using a deep learning multistage… 14575

(it is the usual scenario for the majority of models). In [11], the authors present an
experimental comparison of various classification algorithms such as random for‑
ests and gradient boosting classifiers for unbalanced scoring datasets. The presented
outcome depicts that random forests and gradient boosting algorithms outperform
the remaining models involved in the comparison (e.g. C4.5, quadratic discriminant
and k-nearest neighbours—kNNs). However, the complexity of these approaches
may jeopardize the ‘visibility’ of the internal processes and lead to consider them
as ‘black boxes’. In [12], the authors conclude that SVMs improve the accuracy
of events detection compared to logistic regression, linear discrimination analysis
and kNNs. A survey on SVMs introduces the application of the technology and
the techniques adopted to predict defaults using broad and narrow definitions [13].
In any case, SVMs are not suitable for the management of large data sets and do
not perform well when noise is present in data (e.g. overlapping classes). Another
effort presented in [14] tries to evaluate ML models (SVMs, bagging, boosting and
random forests) to predict bankruptcy one year prior to the event. The authors also
compare the performance of the adopted algorithms with results retrieved by dis‑
criminant analysis, logistic regression and NNs. The aforementioned attempt evalu‑
ates the strength of ensemble models over single classifiers focusing on the accuracy
of the outcomes. However, bagging may suffer from high bias if it is not modelled
properly leading to underfitting while becoming computationally expensive when
large scale data is the case. Boosting cannot, usually, be implemented in real time
due to its increased complexity and may result in multiple parameters having direct
effects on the behaviour of the model. Authors et al. in [15] proposed the Precison‑
Rank and the total detection cost as the appropriate metrics for measuring the detec‑
tion performance in credit datasets. In an additional effort presented in [16], the
authors focus on an effective learning strategy for addressing the verification latency
and the alert–feedback interaction problem, while they propose a formalization of
the fraud detection problem that realistically describes the operating conditions of
FDSs that everyday analyse massive streams of credit card transactions. A denoising
autoencoder for credit risk analysis has been introduced to remove the noise from
the dataset [17]. Denoising autoencoders often yield better representations when
trained on corrupted versions of a dataset; thus, they can capture information and fil‑
ter out noise more effectively than traditional methods [17]. A deep autoencoder and
a restricted Boltzmann machine (RBM) that can reconstruct normal transactions to
finally find anomalies, have been applied to a credit card dataset for fraud detection
[1]. The authors conclude that the combination of the autoencoder with the RBM
outperforms other techniques when the training dataset is large enough to train them
efficiently [1]. Sparse autoencoders and generative adversarial networks (GANs)
have been also adopted to detect potential frauds [18]. The discussed models can
achieve higher performance than other state-of-the-art one-class methods such as
one-class Gaussian process (GP) and support vector data description (SVDD). In
general, autoencoders may be somehow limited in the processes that can perform.
One potential use may be the pre-training of a model to get the dataset latent repre‑
sentation and isolate the most significant features. This means that for concluding
a classification process, autoencoders should be combined with other schemes. In
[19], the authors introduce a hybrid ‘Relief-CNN’ model, i.e. a combination of the

13
14576 G. Zioviris et al.

following techniques: a CNN and the Relief algorithm. The Relief algorithm is a
filter method approach to feature selection that is notably sensitive to feature interac‑
tions. This algorithm calculates a feature score for each feature which can then be
applied to rank and select top scoring features for feature selection. The utilization
of the relief algorithm can efficiently reduce the size of an image pixel matrix, which
can reduce the computational burden of the CNN. The authors in [20], expand the
labelled data through their social relations to get the unlabelled data and propose
a semi-supervised attentive graph neural network, named SemiGNN to utilize the
multi-view labelled and unlabelled data for fraud detection. Moreover, they propose
a hierarchical attention mechanism to better correlate different neighbours and dif‑
ferent views. In [21] the authors propose a method to combine label propagation
and transductive support vector machine (TSVM) with Dempster–Shafer theory for
accurate default prediction of social lending using unlabelled data. In order to train a
lot of data effectively, they ensemble semi-supervised learning methods with differ‑
ent characteristics. Label propagation is performed so that data having similar fea‑
tures are assigned to the same class and TSVM makes moving away data having dif‑
ferent features. Lastly, the authors of [22] use various machine learning algorithms,
with and without the usage of the AdaBoost and majority voting algorithm, in order
to detect fraudulent transactions.

3 Problem definition

Fraud for a financial institution is a concept adopted to describe an unintentional


transaction using a credit of a debit card. Fraudulent transactions may lead to the
authorization of a financial activity while the genuine customers/users themselves
process a payment to another account controlled by an unauthorized person. The
unintentional transaction is realized through a ‘leakage’ in interactions taking place
between the customer/user and a service provider or the unauthorized person can
directly perform the payment using credit/debit card information without permis‑
sion. According to the Basel Committee on Banking Supervision6, a fraud is
assumed to have occurred when either or both of the two following events hold true:
(i) the financial institution considers that the obligor is unlikely to pay its obligations
in full being unable to realize (sell) security (if held) in order to satisfy the obligor’s
debts; and (ii) the obligor is more than 90 days past due on any material credit obli‑
gation to the financial institution.
The above discussion motivates us to pursue a technique that will immediately
detect fraudulent events based on the behavioural patterns of users’ ‘normal’ trans‑
actions. We focus on the streams of transactions coming from numerous sources and
build a DML/ML system for the management of the relevant data. { }
Assume that the transactions stream is annotated by Q = Ti where Ti is the
ith transaction. Without loss of generality, we consider that Ti is a multivariate

6
  Basel, Committee. (2006). International Convergence of Capital Measurement and Capital Standards:
A Revised Framework, Comprehensive Version. Switzerland: Bank for International Settlements.

13
Credit card fraud detection using a deep learning multistage… 14577

vector incorporating the details of each transaction, i.e. Ti = ⟨xi1 , xi2 , … , xiM ⟩ (M is
the number of the envisioned features). Our model incorporates two parts: (i) the
monitoring part which is responsible to receive transactions from Q and get their
features’ values transferring them to the second part of the proposed model. The
monitoring activity targets to prepare the incoming data for further processing; (ii)
the classification part which is responsible to classify every{transaction as a fraudu‑
lent or not upon historical data depicted by the set H . H = Tj with j ≤ i is a data
}

structure where we store past, processed transactions to become the basis for behav‑
ioural patterns detection and, thus, fraud detection. Our goal is to classify every Ti as
a fraudulent transaction or not. This means that after receiving Ti , we have to decide
amongst the following:

• Decision D1. Taking into consideration the values of each variable in Ti and his‑
torical transactions, classify Ti as a fraudulent transaction iff Ti significantly devi‑
ates from the ‘normal’ transactional pattern;
• Decision D2. If Ti does not represent a significant deviation from the ‘normal’
transactional patterns, classify Ti as non-fraudulent.

As a ‘normal’ transactional pattern, we consider transactions that do not deviate


from a sequence of recurring elements or events where these events can be repeata‑
ble in a predicted manner for each customer/user. For instance, customers/users are
characterized by a large set of recurring financial actions during their everyday
activities. Upon such financial actions, we can { � conclude} a pattern P which is a
theme of the aforementioned events, i.e. P = T1 , T2 , …  . In any case, transactions

T being part of P are past transactions that users performed. In the above described

process, we can identify two significant aspects that deal with the efficient definition
of P and the adoption of a function z(⋅) that delivers the deviation of Ti from P  . For
the former, we have to adopt any pattern extraction algorithm proposed in the rele‑
vant literature or rely on the proposed DML/ML models to improve the performance
(as we show later). In our case, P will be realized through the learning procedure to
derive the ‘shape’ of proposed DML NNs. This means that through the provided
training dataset, our DML NNs will conclude to be aligned with the hidden charac‑
teristics of the underlying data, thus, to realize P and the corresponding class. The
significant is that performing the discussed training and feeding the NNs with fraud‑
ulent transactions as well, our DML model is capable of learning fraud patterns P′ .
Our target is to learn the distribution of both P and P′ , then, to be capable of detect‑
ing similarities upon them and classify Ti as fraudulent or not. For the latter, z(⋅) is,
actually, a classification function that delivers the final decision. z(⋅) gets as inputs P
and P′ and Ti and performs a two-class
{ } classification process. The following equa‑
tion holds true: z(P, P� , Ti ) → C, C where C = 1 ( C = 0 ) represents a fraudulent
transaction. z(⋅) is realized by the last layers of the adopted CNN (see below for an
analysis of the proposed convolutional network).
We have to notice that the aforementioned process is performed after a dimen‑
sionality reduction phase to reveal the most significant parameters of H . Evidently,
this phase will expose strong correlations between the features incorporated in the

13
14578 G. Zioviris et al.

Fig. 1  High-level architecture of the first proposed approach, using a deep autoencoder

envisioned transactions giving the opportunity to keep only the most significant
ones (obviously, these features are aligned with the needs of the discussed problem).
Our approach involves an autoenconder { that builds}upon H and results a new subset
of features, i.e. f (xi , x2 , … , xM ) → xi , x2 , … , xM� with M ′ ≤ M  . f (⋅) is delivered
by the aforementioned autoencoder and targets to learn the compressed or decom‑
pressed representation of the input data. Hence, f (⋅) that can be adopted to recon‑
struct the initial data with a low error giving the opportunity to focus only on the
most important features.

4 Proposed solution

4.1 High‑level architecture

Trying to support a high-quality solution with increased efficiency, we propose the


combination of two DML models, i.e. an autoencoder and a CNN, to conclude a
system that performs fraud detection. The aforementioned DML models are con‑
nected in a ‘sequential’ manner (see Figs. 1 and 2). To the best of our knowledge,
it is one of the first attempts to combine the discussed models to create a powerful
mechanism for fraud detection. At first, our data become the subject of processing
by the autoencoder to realize the envisioned dimensionality reduction and make our
system capable of managing huge volumes of streaming data. Through the adop‑
tion of the autoenconder, we are able to reveal the statistical information of data and
expose strong correlations between features. The output of the hidden layer of the
autoenconder (i.e. the compressed representation of the initial dataset) is transferred
to the CNN. A second advanced data management is performed before the final clas‑
sification to classes C C takes place.
The combination of the two DML models targets to improve the classification
recall which depicts if our approach can eliminate false negative events (i.e. the pro‑
portion of fraudulent transactions that are not detected). As we show later in this
paper, the proposed model also exhibits a good performance related to the precision

13
Credit card fraud detection using a deep learning multistage… 14579

Fig. 2  High-level architecture of the second proposed approach using a deep variational autoencoder

which is mainly affected by false positive events, i.e. normal transactions that are
classified to class C. Hence, the elimination of false negative events becomes the
motivation for the adoption of the ‘ensemble’ DML model that is capable of pro‑
cessing of complex datasets as well. False negatives can be avoided if we enhance
the learning process of the hidden aspects of the adopted datasets. Compared with
other efforts in the respective literature, we go a step forward and identify the ‘hid‑
den’ aspects of fraud detection. We can discern the features that exhibit strong cor‑
relations with the remaining ones. For instance, we propose a more efficient dimen‑
sionality reduction technique via the deep autoencoder that could lead to more
accurate classification results.
For preparing the data before they are consumed by the aforementioned DML
models, we rely on a preprocessing phase. All features are scaled to the unity inter‑
val to ensure that no extreme values are present in the dataset. For the scaling action,
we adopt the min–max normalization technique. Then, following [5], we create new
instances of the minority class, while, in parallel, we reduce the instances of the
majority class (i.e. C ) by implementing oversampling techniques (see next subsec‑
tion for more details). In general, oversampling focuses on the enhancement of the
minority class (i.e. C) to adjust the class distribution of our dataset, thus, to achieve
better performance. The newly created dataset is adopted to train the autoencoder.
For the experiments regarding the simple autoencoder, we adopt three (3) hidden
layers and the exponential linear unit (ELU) as the activation function, and gradu‑
ally, we reach up to ten (10) nodes (features) from the initial 30 ones of dataset. We
consider ELU as it performs better than other activation functions. The activation
function ELU tends to converge the cost to zero more quickly (it can derive negative
values allowing the network to push the mean activation closer to zero) than other
functions while being capable of producing more accurate results. In general, ELU

13
14580 G. Zioviris et al.

activation function decreases the gap between the normal gradient and the unit natu‑
ral gradient, and thereby, it achieves to speed up the learning process [23]. For the
experiments that use a variational autoencoder, we use two (2) hidden layers and we
gradually reach up to ten (10) nodes (features) from the initial 30 ones of the dataset.
As the activation function, we adopt tanh in every layer relying on a set of simu‑
lations to reveal the performance of multiple activation functions and choose the
best one. Finally, we have to notice that the model loss is calculated with the binary
cross-entropy loss function. After the end of the training process, we get the new
feature reduced dataset from the hidden layer and eliminate redundancies in the fea‑
ture representation space. For instance, in a dataset with thirty (30) features, we can
conclude in only fifteen (10) significantly reducing the data space upon which we
deliver the final classification outcome. The above processing increases the speed
of processing with positive impact in our model as we target to support a streaming
environment where numerous transactions are collected. After the discussed step,
we create a new encoded (and reduced) feature dataset which is fed to the CNN,
which have an input layer, two (2) hidden layers and an output layer. In each layer,
we implement a pooling layer, batch normalization and the dropout method to avoid
overfitting. In the first two (2) layers, we adopt the ReLU activation function and for
the output layer the activation process is performed by a Sigmoid function. Our deci‑
sion for adopting the specific activation functions is concluded through an extensive
experimentation that reveals their performance for the specific problem. The model
loss is calculated with the binary cross-entropy loss function (cross-entropy minimi‑
zation is frequently used in optimization and rare event probability estimation). The
CNN is evaluated with the assistance of a test set to evaluate its performance. Code
and data for replicating our experiments are available at https://​github.​com/​ziovi​ris/​
Credit_​Card.

4.2 Oversampling for the minority class

An unbalanced dataset could be a common problem in DML/ML. The reason is that


training a DML/ML algorithm with such a dataset often results in a particular bias
towards the majority class. To tackle the problem of imbalance in the training data‑
set, the authors of [5] have introduced SMOTE which is one of the most popular
oversampling techniques. SMOTE is based on a kNNs upon the Euclidean distance
between data points in the feature space. For every minority instance, i.e. an instance
that belongs to the minority class, k of nearest neighbours are detected, such that they
belong to the same class where C is oversampled. We take each sample in C and intro‑
duce synthetic instances along the line segments joining any/all of the k minority class
nearest neighbours [5]. Depending on the required number of over samples, instances
from the k nearest neighbours are randomly chosen. We incorporate into our decision
making multiple oversampling techniques to reveal their performance when com‑
bined with the proposed autoenconder and the CNN. Hence, apart from the SMOTE
technique, we also study the adoption of additional schemes for the management of
the inbalanced datasets. Another method for oversampling is the k-Means SMOTE
technique. This technique avoids the generation of noise and effectively overcomes

13
Credit card fraud detection using a deep learning multistage… 14581

imbalances between and within classes, by employing the k-means clustering algorithm
in combination with SMOTE oversampling [8]. In Borderline SMOTE technique,
for every minority instance, its k nearest neighbours of the same class are extracted
and some of them are randomly selected based on the oversampling rate [6]. In SVM
SMOTE technique, the method first preprocesses the data by oversampling the minor‑
ity instances in the feature space, then the pre-images of the synthetic samples are
found based on a distance relation between feature space and input space. Finally, these
pre-images are appended to the original dataset [7] Finally, the last technique that is
adopted in this paper is the ADASYN [9]. ADASYN is based on the idea of using a
weighted distribution for instance of C according to their level of difficulty in learn‑
ing. Synthetic data are generated for C being harder to learn than to C instances that
are easier to learn. ADASYN improves the learning ability w.r.t. data distributions and
reducing the biases introduced by the class imbalance problem. The final target is to
adaptively shift the classification decision boundary towards the space of the ‘difficult’
instances.

5 Theory

5.1 Autoencoder for feature selection

Autoencoders are NNs that are, traditionally, used for dimensionality reduction target‑
ing to data representations that may improve the performance while consuming less
memory and run time. Autoencoders are trained to be capable of reproducing their
inputs to outputs. They adopt a hidden layer h(x� ) trained to depict the provided input.
An autoenconder may be viewed as a model containing two parts: (i) the encoder func‑
tion f(x) (x is the input into the autoencoder) and (ii) a decoder scheme that produces a
reconstruction of the initial input g(h). The following equations hold true:

f (x) → h(x� ) (1)

h(x� ) → g(h) = x (2)

f (x), h(x� ) = argmin ‖x − (f (x) o h(x� ))x‖2 (3)


f (x),h(x� )

The encoder function, denoted by f(x), maps the original data x to a latent space h(x� )
which is present at the hidden layer before they are reproduced by the decoder. The
decoder function, denoted by g(h), maps the latent space h(x� ) at the hidden layer
to the output which is the same as the input. The discussed encoding network can
be represented by a standard NN function transferred through an activation func‑
tion, where l is the latent dimension, i.e. (W and b are the weights and biases of the
layers),
l = 𝜎(Wx + b) (4)

13
14582 G. Zioviris et al.

Similarly, the decoder can be depicted in the same way, however, with different
weights, biases and potentially activation functions. The decoding phase can be rep‑
resented by the following equation ( W and b are the weights and biases of the hid‑
′ ′

den layer):

(5)
� � � �
x = 𝜎 (W l + b )
In the proposed autoencoder, we adopt a loss function written in terms of the afore‑
mentioned functions. The loss function is utilized to train the NN through the back‑
propagation algorithm. Through backpropagation, our models is continuously seek‑
ing to limit the error between the calculated values and target outputs. This forces
the hidden layer to use dimensionality reduction and eliminate noise while recon‑
structing the inputs especially when the number of neurons in the hidden layer is
low. The following equations holds true:

(6)
� � � � �
L(x, x ) = ‖x − x ‖2 = ‖x − 𝜎 (W (𝜎(Wx + b)) + b )‖2

5.2 A variational autoencoder for feature selection

Unlike typical autoencoders, variational autoencoders (VAEs) are generative mod‑


els that exhibit different mathematical formulations if compared with autoencod‑
ers. VAEs focus on probabilistic graphical models with posterior probabilities being
approximated by a NN, thus formulating the architecture of an autoencoder. VAEs
try to emulate how data are generated to reveal the underlying causal relations. This
approach differs with discriminating models that aim to learn a predictor given spe‑
cific observations. VAEs rely on strong assumptions for the distribution of latent
features using a variational approach. This approach results in additional loss com‑
ponents and a specific estimator for training purposes, i.e. the stochastic gradient
variational Bayes (SGVB) estimator. The assumption is that data are generated by
a directed graphical model, i.e. p𝜃 (𝐱|𝐡) and that the encoder learns the following
approximation q𝜙 (𝐡|𝐱) to the posterior distribution p𝜃 (𝐡|𝐱) where 𝛟 and 𝛉 denote
the parameters of the encoder and decoder, respectively. The probability distribution
of the latent vector of a VAE typically matches that of the training data much closer
than a standard autoencoder. The objective of VAE has the following form:
� �
L(𝜙, 𝜃, 𝐱) = DKL (q𝜙 (𝐡�𝐱)‖p𝜃 (𝐡)) − 𝔼q𝜙 (𝐡�𝐱) log p𝜃 (𝐱�𝐡) (7)

In the above equation, DKL depicts the Kullback–Leibler divergence. The prior over
the latent features is usually set to be the centred isotropic multivariate Gaussian, i.e.
p𝜃 (𝐡) = N(𝟎, 𝐈) (8)
Commonly, the shape of the variational and the likelihood distributions are chosen
such that they are factorized Gaussian distributions:

13
Credit card fraud detection using a deep learning multistage… 14583

q𝜙 (𝐡|𝐱) = N(𝝆(𝐱), 𝝎2 (𝐱)𝐈) (9)

p𝜃 (𝐱|𝐡) = N(𝝁(𝐡), 𝝈 2 (𝐡)𝐈) (10)


where 𝝆(𝐱) and 𝝎2 (𝐱) , are the encoder outputs, while 𝝁(𝐡) and 𝝈 2 (𝐡) are the decoder
outputs. These formulations are justified by the rationale of simplifying the final
outcomes in the evaluation process of both the KL divergence and the likelihood
term in the variational objective defined above.

5.3 The proposed convolutional neural network

CNNs was first proposed by [24] as the technology adopted to manage data as a
‘picture’ in the two-dimensional space. CNNs have been applied in various domains
like image recognition, object detection, classification, video recognition, recom‑
mender systems, image classification, medical image analysis, natural language pro‑
cessing (NLP) and financial time series. A CNN employs a special operation named
convolution which is a specialized type of linear processing instead of a generic
matrix multiplication, in at least one of its layers. In a CNN hidden layers involve a
set of convolutional processing that convolve with multiplication, other dot product
or cross-correlation calculations. This is significant for the indices within the data
matrix, thus, it affects how weights are set at a selected index point7. ReLU is a
widely adopted activation function followed by additional convolutions like pooling
layers, fully connected layers and normalization layers. In general, convolution is an
operation described by a smoothed estimate of the input function x(t),

∫ (11)
s(t) = x(a)w(t-a)da

where w(a) is the kernel function in the form of a valid probability density function
and s(t) is the output. The convolutional operation is usually depicted by an asterisk:
s(t) = (x*w)(t) (12)
Alternatively, we can define the discrete convolution as the following equation
depicts:

s(t) = (x ∗ w)(t) = x(a)w(t-a) (13)

Usually, we use convolutions for multiple axis at a specific time. So, the above func‑
tions can be expressed as:
∑∑
S(i, j) = (I ∗ K)(i, j) = I(m,n)K(i-m,j-n) (14)

7
  Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning. MIT Press, 2016. http://​www.​
deepl​earni​ngbook.​org.

13
14584 G. Zioviris et al.

where K and I are the used kernel and provided input, respectively. Equivalently, we
can exploit the commutative ability of the convolution operation:
∑∑
S(i, j) = (K ∗ I)(i, j) = I(i-m,j-n)K(m,n) (15)

The motivation behind the use of such a technique is that convolution leverages
three important ideas that can help improve a machine learning system, i.e. (i) sparse
interactions; (ii) parameter sharing; and (iii) equivariant representations8.

6 Results and discussion

6.1 Experimental set‑up and performance metrics

We report on the evaluation of the proposed model upon a real dataset. Our dataset
contain credit cards transactions made in September 2013 by European cardhold‑
ers collected during a research collaboration of Worldline and the Machine Learn‑
ing Group (http://​mlg.​ulb.​ac.​be) of Université Libre de Bruxelles on big data min‑
ing and fraud detection [15, 16, 25–29]. The adopted dataset depicts transactions
occurred in two (2) days, where 492 fraudulent out of 284,807 transactions are pre‑
sent. The dataset is highly unbalanced, the C class (frauds) account for 0.172% of all
the available transactions. It contains only numerical features fed into our encoder.
The feature ‘Class’ in the dataset is the classification outcome that is realized equal
to unity ( C = 1 ) in case of a fraud and zero, otherwise. In our experimentation, we
perform feature scaling applying the min–max normalization (i.e. we subtract the
minimum of every feature, then dividing by the range of the feature).
Six performance metrics are adopted to evaluate our model, i.e. accuracy ( 𝛼 ), pre‑
cision ( 𝜖 ), recall ( 𝜁  ), the Matthews correlation coefficient or MCC ( 𝜇 ), the F1-score
( 𝛿 ) and the area under curve or simply (AUC). 𝜖 is the fraction of true events (i.e.
frauds) amongst all samples which are classified as frauds, while 𝜁 is the fraction
of frauds which have been classified correctly over the total amount of frauds. 𝛿 is
a performance metric that combines both, 𝜖 and 𝜁  . The Matthews Correlation Coef‑
ficient (MCC) is a machine learning measure which is used to check the balance
of the binary (two-class) classifiers. It takes into account all the true and false val‑
ues that is why it is generally regarded as a balanced measure which can be used
even if there are different classes. AUC provides an aggregate measure of perfor‑
mance across all possible classification thresholds. The area under the curve (often
referred to as simply the AUC) is equal to the probability that a classifier will rank
a randomly chosen positive instance higher than a randomly chosen negative one
(assuming ‘positive’ ranks higher than ‘negative’). It is also common to calculate
the Area Under the ROC Convex as any point on the line segment between two pre‑
diction results that can be achieved by randomly using one or the other system with

8
  Ian Goodfellow, Yoshua Bengio and Aaron Courville. Deep Learning. MIT Press, 2016. http://​www.​
deepl​earni​ngbook.​org.

13
Credit card fraud detection using a deep learning multistage… 14585

Table 1  Three-fold cross-validation experiments


Models 𝜁 (%) 𝜖 (%) 𝛿 (%) 𝛼 (%) 𝜇 (%) AUC (%)

CNN 77.20 90.50 83.90 99.97 83.60 97.10


SVM 77.43 85.81 81.41 99.94 81.49 96.90
AE-CNN 76.42 84.30 80.17 99.93 80.23 96.80
SMOTE-AE-CNN 73.78 88.32 80.39 99.94 80.69 95.80
Borderline SMOTE-AE-CNN 76.29 84.45 80.12 99.93 80.20 95.20
SVM SMOTE-AE-CNN 76.42 87.44 81.56 99.94 81.72 96.90
K-Means SMOTE-AE-CNN 77.44 86.00 81.50 99.94 81.58 95.70
ADASYN-AE-CNN 77.03 87.33 81.86 99.94 81.99 97.20
VAE-CNN 77.84 82.90 80.29 99.93 80.30 95.50
SMOTE-VAE-CNN 87.80 6.62 12.31 97.84 23.77 96.70
Borderline SMOTE-VAE-CNN 88.01 7.85 14.41 98.19 25.98 95.90
SVM SMOTE-VAE-CNN 87.60 9.84 17.70 98.59 29.10 94.80
K-Means SMOTE-VAE-CNN 88.01 5.62 10.56 97.43 21.87 95.80
ADASYN-VAE-CNN 93.50 1.29 2.54 87.59 10.16 96.60

probabilities proportional to the relative length of the opposite component of the


segment The following equations hold true:
TP + TN
𝛼 = (16)
TP + TN + FP + FN

TP
𝜖= (17)
TP + FP

TP
𝜁= (18)
TP + FN

(TP ⋅ TN) − (FP ⋅ FN)


𝜇=√ (19)
(TP + FP) ⋅ (TP + FN) ⋅ (TN + FP) ⋅ (TN + FN)

𝜖⋅𝜁
𝛿 =2 ⋅ (20)
𝜖+𝜁
In the above equations, TP (true positive) is the number of frauds which have been
classified correctly. FP (false positive) is the number of normal transactions which
have been classified as frauds. FN (false negative) is the number of frauds which
have been classified as normal ones. TN (true negative) is the number of normal
transactions that have been classified as normal.

13
14586 G. Zioviris et al.

Table 2  Four-fold cross-validation experiments


Models 𝜁 (%) 𝜖 (%) 𝛿 (%) 𝛼 (%) 𝜇 (%) AUC (%)

CNN 78.60 91.20 84.49 99.98 83.49 97.80


SVM 77.43 85.42 81.23 99.94 81.61 96.80
AE-CNN 79.27 86.28 82.63 99.94 82.67 97.10
SMOTE-AE-CNN 76.80 87.20 81.70 99.94 81.87 97.70
Borderline SMOTE-AE-CNN 76.42 86.04 80.94 99.94 81.06 96.40
SVM SMOTE-AE-CNN 75.81 85.55 80.38 99.94 80.50 96.90
K-Means SMOTE-AE-CNN 74.19 85.68 79.52 99.93 79.69 98.20
ADASYN-AE-CNN 76.22 86.41 80.99 99.94 81.12 96.00
VAE-CNN 76.83 82.90 79.75 99.93 79.77 92.80
SMOTE-VAE-CNN 87.20 5.11 9.66 97.18 20.73 94.70
Borderline SMOTE-VAE-CNN 87.40 6.87 12.74 97.93 24.17 94.60
SVM SMOTE-VAE-CNN 87.40 7.56 13.91 98.13 25.39 95.40
K-Means SMOTE-VAE-CNN 87.80 4.97 9.40 97.08 20.49 96.70
ADASYN-VAE-CNN 93.70 1.30 2.56 87.67 10.22 96.60

Table 3  Five-fold cross-validation experiments


Models 𝜁 (%) 𝜖 (%) 𝛿 (%) 𝛼 (%) 𝜇 (%) AUC (%)

CNN 78.20 91.01 84.15 99.99 84.50 97.52


SVM 77.64 85.65 81.44 99.93 81.50 96.10
AE-CNN 77.80 86.06 81.70 99.94 81.82 96.30
SMOTE-AE-CNN 76.40 89.30 79.07 99.94 82.59 98.00
Borderline SMOTE-AE-CNN 76.42 87.23 81.47 99.94 81.62 95.50
SVM SMOTE-AE-CNN 77.03 88.76 82.48 99.94 82.66 95.90
K-Means SMOTE-AE-CNN 74.59 85.95 79.87 99.94 80.04 93.30
ADASYN-AE-CNN 76.63 87.88 81.87 99.94 82.03 95.60
VAE-CNN 78.25 82.09 80.12 99.93 80.11 95.90
SMOTE-VAE-CNN 88.41 6.20 11.59 97.67 23.07 94.70
Borderline SMOTE-VAE-CNN 88.62 7.03 13.03 97.96 24.64 94.50
SVM SMOTE-VAE-CNN 86.99 10.18 18.22 98.65 29.49 95.80
K-Means SMOTE-VAE-CNN 86.99 7.76 14.25 98.19 25.67 95.10
ADASYN-VAE-CNN 93.90 1.19 2.36 86.55 9.75 96.90

6.2 Performance assessment

Our set of experiments involve the implementation of a stratified m-fold cross-


validation technique to ensure that the separations of the adopted dataset are rela‑
tively unbiased. In m-fold cross-validation, we define the number m as the number
of partitions that the dataset is separated. m − 1 partitions are used for training,
while the last one is used for testing. The above process is repeated m times, with

13
Credit card fraud detection using a deep learning multistage… 14587

Table 4  10-fold cross-validation experiments

CNN 77.65% 91.27% 84.47% 99.93% 83.70% 93.20%


SVM 77.62% 85.45% 81.36% 99.93% 81.60% 92.80%
AE-CNN 77.43% 86.78% 81.84% 99.94% 81.95% 94.00%
SMOTE-AE-CNN 77.84% 87.24% 82.27% 99.94% 82.38% 99.50%
Borderline SMOTE-AE-CNN 75.60% 87.74% 81.22% 99.94% 81.42% 95.80%
SVM SMOTE-AE-CNN 78.25% 87.30% 82.53% 99.94% 82.62% 99.90%
K-Means SMOTE-AE-CNN 75.81% 86.95% 81.00% 99.94% 81.16% 98.10%
ADASYN-AE-CNN 77.23% 87.56% 82.07% 99.94% 82.21% 96.50%
VAE-CNN 78.86% 83.44% 81.09% 99.94% 81.09% 93.20%
SMOTE-VAE-CNN 86.99% 8.14% 14.89% 98.28% 26.32% 94.80%
Borderline SMOTE-VAE-CNN 88.21% 6.95% 12.88% 97.94% 24.43% 96.70%
SVM SMOTE-VAE-CNN 87.60% 9.66% 17.39% 98.56% 28.81% 94.00%
K-Means SMOTE-VAE-CNN 88.12% 6.12% 11.45% 97.64% 22.89% 92.50%
ADASYN-VAE-CNN 94.11% 1.20% 2.37% 86.58% 9.78% 97.30%

a different partition used for testing at each time. In our experiments, we use
m ∈ {3, 4, 5, 10} . Our results are shown in Tables 1, 2, 3 and 4, respectively.
In the following figures, we present our experimental evaluation outcomes for
combinations of the main aforementioned models, i.e. AE-CNN and VAE-CNN.
In this set of experiments, we consider m ∈ {4} . In Fig. 3, we observe the perfor‑
mance of the AE-CNN core, for m = 4 . In Fig. 4, we observe the performance of
the VAE-CNN core, for m = 4.

6.3 Results

When we adopt a m = 3 fold cross-validation, the best model regarding 𝜁 (i.e. recall,
which represents the number of fraudulent transactions classified correctly) is the
combination of ADASYN-VAE-CNN 1 ( 𝜁 = 93.50% ). Nevertheless, in terms of 𝜖 ,
𝛿 and 𝜇 , the discussed model exhibits the worst performance. Recall that 𝜖 , repre‑
sents the number of normal transactions classified as fraudulent. In absolute terms,
1,300 normal transactions are classified as fraudulent out of a number of 80,000
transactions in total. The best model, in terms of 𝜖 , 𝛼 , 𝜇 and 𝛿 , is the scheme adopt‑
ing a single CNN 1 ( 𝜖 = 90.50% , 𝛼 = 99.97%, 𝜇 = 83.60% and 𝛿 = 83.90% ). How‑
ever, the second best model in terms of 𝜁 is the K-means SMOTE-AE-CNN model 1
( 𝜁 = 77.44% ). The important thing about the model is that it performed adequately
in terms of 𝜖 , 𝛿 and 𝜇 , in contrast to the aforementioned ADASYN-VAE-CNN.
When m = 4 , the best model regarding 𝜁 is the ADASYN-VAE-CNN model 2
( 𝜁 = 93.70% ). However, in terms of 𝜖 and 𝛿 exhibits the worst performance. The
best model in terms of 𝜖 , 𝛼 , 𝜇 and 𝛿 is the CNN 2 model ( 𝜖 = 91.20% , 𝛼 = 99.98% ,
𝜇 = 83.49% and 𝛿 = 84.49% ). An interesting result is that the AE-CNN 2 model has
the second best performance in terms of 𝜁 ( 𝜁 = 79.27% ) outperforming the remain‑
ing models except the aforementioned ADASYN-VAE-CNN model. The important

13
14588 G. Zioviris et al.

Fig. 3  Performance of AE-CNN for m = 4 (up: AE-CNN model performance; bottom: AE-CNN ROC
curve)

thing about the model is that it performed adequately in terms of 𝜖 , 𝛿 and 𝜇 , in con‑
trast to the aforementioned ADASYN-VAE-CNN.
With m = 5 , the best model regarding 𝜁 is the ADASYN-VAE-CNN 3 combina‑
tion ( 𝜁 = 93.90% ). The best model in terms of 𝜖 , 𝜇 , 𝛼 and 𝛿 is the individual CNN 3
( 𝜖 = 91.01% , 𝛼 = 99.99% , 𝜇 = 84.50% , 𝛿 = 84.15% ). Interestingly, the VAE-CNN 3
combination outperforms the other models in terms of 𝜁  , ( 𝜁 = 78.25% ) while, at the
same time, it performed adequately in terms of 𝜖 , 𝛿 and 𝜇 , in contrast to the aforemen‑
tioned ADASYN-VAE-CNN.
With m = 10 , the best performed model regarding 𝜁 is the combination between
ADASYN-VAE-CNN 4 models ( 𝜁 = 94.11% ). In terms of 𝜖 𝜇 and 𝛿 , the CNN 4 model
has the best performance ( 𝜖 = 91.27% , 𝜇 = 83.70% and 𝛿 = 84.47% ). An interesting
result is that the VAE-CNN 4 model has the second best performance in terms of recall,
i.e. 𝜁 = 78.86% , outperforming the remaining models (except the ADASYN-VAE-CNN

13
Credit card fraud detection using a deep learning multistage… 14589

Fig. 4  Performance of VAE-CNN Model for m = 4 (up: VAE-CNN model performance; bottom: VAE-
CNN ROC curve)

models as mentioned above) while, at the same time, it performed adequately in terms
of 𝜖 , 𝛿 and 𝜇 , in contrast to the aforementioned ADASYN-VAE-CNN.

6.4 Discussion

In this paper, the proposed method tries to perform better than conventional ML
methods in terms of recall, the metric that shows which of the transactions have
been classified correctly as fraud over the total amount of frauds. Keeping that in
mind, the ADASYN-VAE-CNN model outperforms the rest of the models by far
(i.e. Average recall = 93.80% approximately). While recall is important, we cannot
overlook the poor performance of the model in terms of precision (Average Preci‑
sion = 1.25% approximately), which leads of course to a poor performance in terms
of F1-score and MCC. Even then, in absolute terms, 1,300 normal transactions were

13
14590 G. Zioviris et al.

Table 5  Comparison between Models AUC (%)


our models’ results and the
results from [1], i.e. AE and AE-CNN 94.00
RBM
SMOTE-AE-CNN 99.50
Borderline SMOTE-AE-CNN 95.80
SVM SMOTE-AE-CNN 99.90
ADASYN-AE-CNN 96.50
K-Means SMOTE-AE-CNN 98.10
VAE-CNN 93.20
Borderline SMOTE-VAE-CNN 96.70
SVM SMOTE-VAE-CNN 94.00
ADASYN-VAE-CNN 97.30
K-Means SMOTE-VAE-CNN 92.50
SMOTE-VAE-CNN 94.80
AE ([1]) 96.03
RBM ([1]) 95.05

Table 6  Results of [22] without Models 𝛼 (%) 𝜇 (%)


the use of AdaBoost
AE-CNN 99.94 82.67
SMOTE-AE-CNN 99.94 81.87
Borderline SMOTE-AE-CNN 99.94 81.06
SVM SMOTE-AE-CNN 99.94 80.50
ADASYN-AE-CNN 99.94 81.12
K-Means SMOTE-AE-CNN 99.93 79.69
VAE-CNN 99.93 79.77
Borderline SMOTE-VAE-CNN 97.93 24.17
SVM SMOTE-VAE-CNN 98.13 25.39
ADASYN-VAE-CNN 87.67 10.22
K-Means SMOTE-VAE-CNN 97.08 20.49
SMOTE-VAE-CNN 97.18 20.78
Naive Bayes ([22]) 97.71 21.90
Decision Tree ([22]) 99.92 77.50
Random Forest ([22]) 99.89 60.40
Gradient Boosted Tree ([22]) 99.90 74.60
Decision Stump ([22]) 99.90 71.10
Random Tree ([22]) 99.87 49.70
Deep Learning (MLP) ([22]) 99.92 78.70
Neural Network ([22]) 99.94 81.20
MLP ([22]) 99.93 80.60
Linear Regression ([22]) 99.91 68.30
Logistic Regression ([22]) 99.93 78.60
SVM ([22]) 99.94 81.30

13
Credit card fraud detection using a deep learning multistage… 14591

Table 7  Results of [22] with the Models 𝛼 (%) 𝜇 (%)


use of AdaBoost
AE-CNN 99.94 82.67
SMOTE-AE-CNN 99.94 81.87
Borderline SMOTE-AE-CNN 99.94 81.06
SVM SMOTE-AE-CNN 99.94 80.50
ADASYN-AE-CNN 99.94 81.12
K-Means SMOTE-AE-CNN 99.93 79.69
VAE-CNN 99.93 79.77
Borderline SMOTE-VAE-CNN 97.93 24.17
SVM SMOTE-VAE-CNN 98.13 25.39
ADASYN-VAE-CNN 87.67 10.22
K-Means SMOTE-VAE-CNN 97.08 20.49
SMOTE-VAE-CNN 97.18 20.78
(AdaBoost)Naive Bayes ([22]) 98.04 23.50
(AdaBoost)Decision Tree ([22]) 99.92 77.50
(AdaBoost)Random Forest ([22]) 99.89 60.40
(AdaBoost)Gradient Boosted Tree ([22]) 99.90 74.70
(AdaBoost)Decision Stump ([22]) 99.91 71.10
(AdaBoost)Random Tree ([22]) 99.87 49.70
(AdaBoost)Deep Learning (MLP) ([22]) 99.92 76.50
(AdaBoost)Neural Network ([22]) 99.93 80.70
(AdaBoost)MLP ([22]) 99.93 80.60
(AdaBoost)Linear Regression ([22]) 99.91 68.60
(AdaBoost)Logistic Regression ([22]) 99.93 78.60
(AdaBoost)SVM ([22]) 99.93 79.60

classified as fraudulent out of a number of 80,000 transactions in total, which is


approximately 1.6% of the total transactions. A financial institution can easily use
the aforementioned model during regulatory and external audit controls, or dur‑
ing periods that credit fraud is higher than normal times due to external reasons
(Tables 5, 6, 7, 8).
Nevertheless, in this paper, we also propose a set of models that have an AE-
CNN and VAE-CNN core, which perform excellent in terms of recall, while at the
same time their performance in terms of precision is decent. In particular, for K = 3
folds, the best model which meets the above expectations is the K-means SMOTE-
AE-CNN model. For K = 4 , the best model that meets the above expectations is the
AE-CNN model. For K = 5 and K = 10 , the best model that meets the above expec‑
tations is the VAE-CNN model. Even then, their performance in terms of recall is
better that the performance of the conventional cornerstone ML methods.

13
14592 G. Zioviris et al.

Table 8  Results of major voting Models 𝛼 (%) 𝜇 (%)


in [22]
AE-CNN 99.94 82.67
SMOTE-AE-CNN 99.94 81.87
Borderline SMOTE-AE-CNN 99.94 81.06
SVM SMOTE-AE-CNN 99.94 80.50
ADASYN-AE-CNN 99.94 81.12
K-Means SMOTE-AE-CNN 99.93 79.69
VAE-CNN 99.93 79.77
Borderline SMOTE-VAE-CNN 97.93 24.17
SVM SMOTE-VAE-CNN 98.13 25.39
ADASYN-VAE-CNN 87.67 10.22
K-Means SMOTE-VAE-CNN 97.08 20.49
SMOTE-VAE-CNN 97.18 20.78
Decision Stump + Gradient Boosted Tree ([22]) 99.85 34.30
Decision Tree + Decision Stump ([22]) 99.85 36.10
Decision Tree + Gradient Boosted Tree ([22]) 99.92 73.70
Decision Tree + Naive Bayes ([22]) 99.93 78.80
Naive Bayes + Gradient Boosted Tree ([22]) 99.92 74.20
Neural Network + Naive Bayes ([22]) 99.94 82.30
Random Forest + Gradient Boosted Tree ([22]) 99.87 46.80

7 Comparative assessment

In this section, we compare our models with the results that the authors of [1] and
[22] present in their work, considering that the same dataset is used, so the compari‑
son is easier to implement. In the first paper, the authors use a deep autoencoder and
a RBM model, in order to outperform other techniques. In this study the authors use
the hyperbolic tangent function or ‘Tanh’ function to encode and decode the input
to the output. As a sample of a neural network, when they have already used the AE
model, they reconstruct the error by using backpropagation. Backpropagation com‑
putes the ‘error signal’, propagates the error backwards through network that starts
at the output units by using the condition that the error forms the difference between
the actual and desired output values. Based on the AE, they use parameter gradients
for realizing backpropagation. They used hidden layers by having 3 encoders and 3
decoders, As mentioned above, every hidden layer they used was the ‘Tanh’ activa‑
tion function, while they divide the train and test with 80 and 20 percentage of data
by using normal transactions to predict fraudulent transactions. Another algorithm
that the authors use is RBM. There are two structures in this algorithm, visible or
input layer and hidden layer. Each input node takes the input feature from the dataset
to be learned. The design is different from other deep learning, because there is no
output layer. The output of RBM is getting the reconstruction back to the input. The
point of RBM is the way in which they learn by themselves for data reconstruction.
The only metric that is used for comparison is the AUC score.

13
Credit card fraud detection using a deep learning multistage… 14593

In [22], the authors use various machine learning algorithms, including Naive
Bayes, SVM, Decision Tree, Random Forest, Gradient Boosted Tree, Decision
Stump, MLP, with and without the use of AdaBoost and majority voting algorithms,
with the scope of detecting fraudulent transactions. The metrics Matthew Coefficient
Correlation (MCC) and accuracy are used for comparison. None of those papers
provide sufficient information about precision, recall and F1 score. In addition, none
of these papers reports anything about the usage of K-fold. In every comparison, the
metrics of our models are the ones of the number of folds that give the best results,
following the previous tables.

7.1 Comparison between our models’ results and the results from [1]

From this comparison, we can see that the majority of our models perform better
than the models of the authors of [1]. The best of our models in terms of AUC’s
score performs a score of 99.90%, while the best model in [1] has a score of 96.03%.
Once again, we state that this paper did not provide sufficient information about pre‑
cision, recall, F1 score of any other metric besides AUC’s score.

7.1.1 Comparison between our models’ results and the results from [22]

From this comparison, we can see that the majority of our models with the deep
autoencoder, perform better than the models of the authors of [22]. The best of
our models in performs a score of 99.94%, in terms of accuracy ( 𝛼 , and a score of
82.67% in terms of MCC ( 𝜇 ) while the best model in [22] has a score of 99.94%
in accuracy and a score of 81.30% in terms of MCC. Once again, we state that this
paper did not provide sufficient information about precision, recall, F1 score or any
other metric.
From this comparison, we can see that the majority of our models with the deep
autoencoder, perform better than the models of the authors of [22]. The best of
our models in performs a score of 99.94%, in terms of accuracy ( 𝛼 , and a score of
82.67% in terms of MCC ( 𝜇 ) while the best model in [22] has a score of 99.93%
in accuracy and a score of 80.70% in terms of MCC. Once again, we state that this
paper did not provide sufficient information about precision, recall, F1 score or any
other metric.
From this comparison, we can see that the majority of our models with the deep
autoencoder, perform better than the models of the authors of [22]. The best of
our models in performs a score of 99.94%, in terms of accuracy ( 𝛼 , and a score of
82.67% in terms of MCC ( 𝜇 ) while the best model in [22] has a score of 99.94%
in accuracy and a score of 82.30% in terms of MCC. Once again, we state that this
paper did not provide sufficient information about precision, recall, F1 score or any
other metric.

13
14594 G. Zioviris et al.

8 Conclusions

In this study, we propose the combination of multiple deep learning technolo‑


gies like autoencoders (AE and VAE) and convolutional neural networks (CNNs)
to predict fraud cases in financial interactions. The discussed autoencoder is
adopted for dimensionality reduction while the CNN is utilized to perform the
final classification of the type of each transaction (fraudulent or not). We adopt
various oversampling techniques to deal with the limited number of the positive
class. Our experiments show that the proposed combination between the autoen‑
coder and the CNN performs better in terms of recall while, in parallel, precision
and F1-score remain at acceptable levels. We also show that the combination of
the ADASYN oversampling technique with a variational autoencoder and a CNN
increases significantly the performance in terms of recall, at the cost of a poor
performance in terms of precision and F1-measure. We argue that the discussed
combination can be easily used during regulatory and external audit controls, or
during periods that credit fraud is higher than normal times due to external rea‑
sons. In addition, the results of our models are compared with the results of algo‑
rithms that use the same dataset and are recently published. The result of this
comparison is that the majority of our models perform better than the proposed
ones in those papers. This aspect gains our attention and becomes one of our tar‑
gets for future research activities, i.e. the incorporation into our model of the tem‑
poral axis and the study of the seasonality detected in fraudulent events.

Author Contributions  All authors contributed to the study conception and design. All authors read and
approved the final manuscript.

Funding  The authors declare that no funds, grants, or other support were received during the preparation
of this manuscript.

Data Availability Statement  The datasets generated analysed during the current study are available in the
[https://​www.​kaggle.​com/​karti​k2112/​fraud-​detec​tion-​banks​im/​data repository”.

Declarations 

Conflict of interest  The authors whose names are listed immediately below certify that they have NO affili‑
ations with or involvement in any organization or entity with any financial interest, or non-financial interest
in the subject matter or materials discussed in this manuscript.

Ethical approval  Authors confirm that the appropriate ethics review has been followed.

Informed Consent  Not applicable

13
Credit card fraud detection using a deep learning multistage… 14595

References
1. Pumsirirat A, Yan L (2018) Credit card fraud detection using deep learning based on auto-
encoder and restricted Boltzmann machine. Int J Adv Comput Sci Appl 9(1):18–25. https://​doi.​
org/​10.​14569/​IJACSA.​2018.​090103 (ISSN 21565570)
2. Pascal V, Hugo L, Isabelle L, Yoshua B, Pierre-Antoine M (2010) Stacked denoising autoencod‑
ers: learning useful representations in a deep network with a local denoising criterion. J Mach
Learn Res 11(110):3371–3408
3. Valueva MV, Nagornov NN, Lyakhov PA, Valuev GV, Chervyakov NI (2020) Application of the
residue number system to reduce hardware costs of the convolutional neural network implemen‑
tation. Math Comput Simul 177:232–243. https://​doi.​org/​10.​1016/j.​matcom.​2020.​04.​031
4. Dupond S (2019) A thorough review on the current advance of neural network structures. Annu
Rev Control 14:200–230
5. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE synthetic minority over-sam‑
pling technique. J Artif Intell Res 16(February 2017):321–357. https://​doi.​org/​10.​1613/​jair.​953
(ISSN 10769757)
6. Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbal‑
anced data sets learning. Adv Intell Comput 3644:878–887. https://​doi.​org/​10.​1007/​11538​05991
7. Zeng ZQ, Gao J (2009) Improving SVM Classification with Imbalance Data Set. In: Leung CS, Lee
M, Chan JH (eds) Neural Information Processing. ICONIP 2009. Lecture Notes in Computer Sci‑
ence, vol 5863. Springer, Berlin, Heidelberg. https://​doi.​org/​10.​1007/​978-3-​642-​10677-​444
8. Last F, Douzas G, Bação F (2017) Oversampling for imbalanced learning based on K-Means and
SMOTE
9. He H , Bai Y, Garcia E, Li S (2008) ADASYN: Adaptive Synthetic Sampling Approach for Imbal‑
anced Learning. Proceedings of the International Joint Conference on Neural Networks. 1322–1328.
https://​doi.​org/​10.​1109/​IJCNN.​2008.​46339​69
10. Prasad NR, Almanza-Garcia S, Lu TT (2009) Anomaly detection. Comput Mater Contin 14(1):1–
22. https://​doi.​org/​10.​1145/​15418​80.​15418​82 (ISSN 15462218)
11. Iain B, Christophe M (2012) An experimental comparison of classification algorithms for imbal‑
anced credit scoring data sets. Expert Syst Appl 39(3):3446–3453. https://​doi.​org/​10.​1016/j.​eswa.​
2011.​09.​033
12. Bellotti T, Crook J (2009) Support vector machines for credit scoring and discovery of significant
features. Expert Syst Appl 36(2 PART 2):3302–3308. https://​doi.​org/​10.​1016/j.​eswa.​2008.​01.​005
(ISSN 09574174)
13. Harris T (2013) Quantitative credit risk assessment using support vector machines: Broad versus
Narrow default definitions. Expert Syst Appl 40(11):4404–4413. https://​doi.​org/​10.​1016/j.​eswa.​
2013.​01.​044 (ISSN 09574174)
14. Barboza F, Kimura H, Altman E (2017) Machine learning models and bankruptcy prediction.
Expert Syst Appl 83:405–417. https://​doi.​org/​10.​1016/j.​eswa.​2017.​04.​006 (ISSN 09574174)
15. Dal Pozzolo A, Caelen O, Borgne Y, Waterschoot S, Bontempi G (2014) Learned lessons in credit
card fraud detection from a practitioner perspective. Expert Syst Appl 41(10):4915–4928. https://​
doi.​org/​10.​1016/j.​eswa.​2014.​02.​026 (ISSN 09574174)
16. Dal Pozzolo A, Boracchi G, Caelen O, Alippi C, Bontempi G (2018) Credit card fraud detection: a
realistic modeling and a novel learning strategy. IEEE Trans Neural Netw Learn Syst 29(8):3784–
3797. https://​doi.​org/​10.​1109/​TNNLS.​2017.​27366​43
17. Fan Q, Yang J (2018) A Denoising Autoencoder Approach for Credit Risk Analysis. https://​doi.​org/​
10.​1145/​31944​52.​31944​56
18. Chen J, Shen Y, Ali R (2019) Credit Card Fraud Detection Using Sparse Autoencoder and Genera‑
tive Adversarial Network. 2018 IEEE 9th Annual Information Technology, Electronics and Mobile
Communication Conference, IEMCON 2018, (May): 1054–1059. https://​doi.​org/​10.​1109/​IEMCON.​
2018.​86148​15.
19. Zhu B, Yang W, Wang H, Yuan Y (2018) A hybrid deep learning model for consumer credit scoring.
2018 International Conference on Artificial Intelligence and Big Data, ICAIBD 2018, (May):205–
208. https://​doi.​org/​10.​1109/​ICAIBD.​2018.​83961​95
20. Wang D et al (2019) “A Semi-Supervised Graph Attentive Network for Financial Fraud Detection,”
2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, pp 598-607, https://​
doi.​org/​10.​1109/​ICDM.​2019.​00070

13
14596 G. Zioviris et al.

21. Kim A, Cho SB (2019) An ensemble semi-supervised learning method for predicting defaults in
social lending. Eng Appl Artif Intel 81:193–199. https://​doi.​org/​10.​1016/j.​engap​pai.​2019.​02.​014
22. Randhawa K, Loo CK, Seera M, Lim CP, Nandi AK (2018) Credit card fraud detection using Ada‑
Boost and majority voting. IEEE Access 6:14277–14284. https://​doi.​org/​10.​1109/​ACCESS.​2018.​
28064​20
23. Clevert DA (2016) Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learn‑
ing by exponential linear units (ELUs). 4th International Conference on Learning Representations,
ICLR 2016 - Conference Track Proceedings, pages 1–14
24. Lecun Y, Bottou L, Bengio Y, Haffner P (1998) A B7CEDGF HIB7PRQTSUDGQICWVYX HIB
edCdSISIXvg5r ‘CdQTw XvefCdS. proc. OF THE IEEE
25. Dal Pozzolo A (2015) Adaptive machine learning for credit card fraud detection—Dalpoz‑

zolo2015PhD.pdf. (December). URL http://​www.​ulb.​ac.​be/​di/​map/​adalp​ozz/​pdf/​Dalpo​zzolo​2015P​
hD.​pdf
26. Dal Pozzolo A, Caelen O, Johnson RA, Bontempi G (2015) Calibrating probability with undersam‑
pling for unbalanced classification. Proceedings—2015 IEEE Symposium Series on Computational
Intelligence, SSCI 2015, (November): 159–166. https://​doi.​org/​10.​1109/​SSCI.​2015.​33
27. Carcillo F, Dal Pozzolo A, Borgne YL, Caelen O, Mazzer Y, Bontempi G (2018) SCARFF: a
scalable framework for streaming credit card fraud detection with spark. Inf Fusion 41(Septem‑
ber):182–194. https://​doi.​org/​10.​1016/j.​inffus.​2017.​09.​005 (ISSN 15662535)
28. Sperduti A, Navarin N, Oneto L (2020) Recent Advances in Big Data and Deep Learning. Proceed‑
ings of the International Neural Networks Society. https://​doi.​org/​10.​1007/​978-3-​030-​16841-4
29. Lebichot B, Borgnee YAL, He-Guelton L, Oble F, Bontempi G (2020b) Recent advances in big data
and deep learning. Proceedings of the International Neural Networks Society, 78–88. https://​doi.​org/​
10.​1007/​978-3-​030-​16841-4

Publisher’s Note  Springer Nature remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.

13

You might also like