0% found this document useful (0 votes)
37 views10 pages

A SMOTe Based Oversampling Data-Point Approach To Solving The Credit Card Data Imbalance Problem in Financial Fraud Detection 2020

Uploaded by

Rifqi Zumadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views10 pages

A SMOTe Based Oversampling Data-Point Approach To Solving The Credit Card Data Imbalance Problem in Financial Fraud Detection 2020

Uploaded by

Rifqi Zumadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

International Journal of Computing and Digital Systems

ISSN (2210-142X)
Int. J. Com. Dig. Sys. 10, No.1 (Feb-2021)
http://dx.doi.org/10.12785/ijcds/100128

A SMOTe based Oversampling Data-Point Approach to


Solving the Credit Card Data Imbalance Problem in Financial
Fraud Detection
Nhlakanipho Mqadi1, Nalindren Naicker1 and Timothy Adeliyi1
1 ICT and Society Research Group; Durban University of Technology, Durban, South Africa

Received 16 Aug. 2020, Revised 15 Sep. 2020, Accepted 25 Dec. 2020, Published 8 Feb. 2021

Abstract: Credit card fraud has negatively affected the market economic order, broken the confidence and interest of stakeholders,
financial institutions, and consumers. Losses from card fraud is increasing every year with billions of dollars being lost. Machine
Learning methods use large volumes of data as examples for learning to improve the performance of classification models. Financial
institutions use Machine Learning to identify fraudulent patterns from the large amounts of historical financial records. However, the
detection of credit card fraud remains as a significant challenge for business intelligence technologies as most datasets containing
credit card transactions are highly imbalanced. To overcome this challenge, this paper proposed the use of the data-point approach in
machine learning. An experimental study was conducted applying Oversampling with SMOTe, a data-point approach technique, on
an imbalanced credit card dataset. State-of-the-art classical machine learning algorithms namely, Support Vector Machines, Logistic
Regression, Decision Tree and Random Forest classifiers were used to perform the classifications and the accuracy was evaluated
using precision, recall, F1-score, and the average precision metrics. The results show that if the data is highly imbalanced, the model
struggles to detect fraudulent transactions. After using the SMOTe based Oversampling technique, there was a significant
improvement to the ability to predict positive classes.

Keywords: Data Imbalance; Fraud Detection; Machine Learning; Oversampling

transactions [5]. However, existing methods are not


1. INTRODUCTION sufficient in real world situations. Adewumi & Akinyelu
In 2015, losses from card fraud reached approximately in [6] stated that, in real life, the amount of legitimate
$21.84 billion, and by 2020, card fraud across the world transaction recorded highly outweigh the fraudulent
was expected to reach nearly $32 billion [1]. Card fraud transactions. The outweighing is known as class
has negatively affected the market economic order, imbalance and as a result, most techniques of detecting
broken the confidence and interest of stakeholders, card fraud are still incapable of achieving ideal fraud
financial institutions, and consumers. The ability to detect detection abilities [6]. In consequence, detection of credit
fraud mitigates the risk of fraudulent activities and card fraud remains as a significant challenge for business
financial losses [2]. Machine learning (ML) is the science intelligence technologies as most datasets containing
of designing and applying algorithms that are able to learn credit card transactions are highly imbalanced.
patterns from historic data [3]. According to Jiang et al. in
[3], one aspect of ML refers to the ability of systems to This study was conducted to investigate if the data-
recognize and classify classes existing in the data. The point approach can help reduce the impact of the class
ML methods use large volumes of data as examples for imbalance problem. In this paper, the case where the
learning. The collection of instances of data is referred to majority classes (legitimate transactions) dominate over
as datasets and machine learning methods uses two sets of minority classes (fraudulent transaction), causing the
data to learn: training dataset and testing dataset [4]. The machine learning classifiers to be more biased towards
introduction of ML has enabled financial institutions to majority classes is referred to as imbalanced data.
use historical credit card data to learn the patterns with an Imbalanced data and bias are one of the major problems in
the field of data mining and machine learning as most ML
aim of distinguishing between fraudulent and legitimate
algorithms assume that data is equally distributed [7]. The

E-mail: [email protected], [email protected], [email protected]


http://journals.uob.edu.bh
278 Nhlakanipho Mqadi, et. al.: A SMOTe based Oversampling Data-Point Approach to …

failure to handle imbalance data compromises the Providers and individuals that have been expelled from
integrity and predictive abilities of machine learning partaking in Federal healthcare programmes in the United
system resulting in high financial impact. The data-point States committed this fraud. The study discusses the
level approach consists of techniques for re-sampling the processing of Part B dataset and proposed a novel fraud
data in order to deal with imbalanced classes. These label mapping method using the providers that have been
techniques include oversampling, under-sampling, and recognized as fraudulent. The dataset was labelled and
feature selection [8]. The aim of this paper was to assert extremely imbalanced with only a few number of cases
the precision, recall, and F1 score of ML algorithms which were flagged as fraud. Seven class distributions
before and after the application of the data-point were generated from the dataset and their behaviors were
technique. The scope of this paper covers the investigation evaluated using six ML techniques, in the interest of
of ML model’s predictive accuracy with imbalance credit fighting the class imbalance problem while also achieving
card dataset. The term accuracy can be defined as the a good fraud identification performance. The findings
percentage of correctly classified instances (TP + TN) / revealed that the learner with the best Area Under the
(TP + TN + FP + FN). Where TP, FN, FP and TN ROC Curve (AUC) score of 0.87302 was RF100 using a
represent the number of true positives, false negatives, class distribution of 90:10. In addition, learners using a
false positives and true negatives, respectively. Predictive class distribution that is more balanced as the 50:50
Accuracy refers to the ability to classify legitimate and distribution produced less favourable results. The study
fraudulent transactions successfully. concluded that keeping more of the dominant class
improved the ability to detect Medicare Part B fraud.
2. RELARED WORK
Similarly, Liu et al. in [17] conducted an experiment
Many other studies [9-11] reviewed and compared the to propose two algorithms to overcome the deficiency of
existing financial fraud detection models to identify the using under-sampling in handling the problem of class
method with the best performance. Patil et al. in [10] used imbalance. The deficiency was that when under-sampling
the confusion matrix and found that, the Random Forest is applied, many majority classes are ignored. Therefore,
model performed better as compared to Logistic the study proposed EasyEnsemble and BalanceCascade.
Regression and Decision Tree in terms of accuracy, EasyEnsemble divides the majority class into several
precision and recall parameters, whereas, Albashrawi in smaller chunks, then the chunks are independently used to
[11] found that the Logistic Regression model appeared to train the learner and at the end, all the outputs by the
be the leading machine learning technique in detecting learners are combined. BalanceCascade uses a sequential
financial fraud. Other researchers [12-13] have proposed training-based approach, wherein each sequence, the
using a hybrid approach. These approaches show some correctly classified examples of the majority class are
improvements on the existing methods and recognize eliminated from being further evaluated in the next
strengths of fraud detection models; for example, sequence. The findings showed that compared to many
Chouiekha et al. in [14] who found that Deep Learning existing methods, both the EasyEnsemble and
algorithms such as Convolution Neural Networks (CNN) BalanceCascade have a higher F-measure, G-mean, and
technique has better accuracy versus traditional machine AUC values and the training time was found to be closely
learning algorithms. Rekha et al. in [15] presented a similar to under-sampling, which according to Liu et al. in
comparison of the performance of several boosting and [17], was significantly faster compared to other
bagging techniques from imbalanced datasets. According approaches.
to Rekha et al. in [15], Oversampling technique takes full
minority samples in the training data into consideration A paper by Ebenuwa et al. in [18] presented Variance
while performing classification. However, the presence of Ranking (VR), which is a feature selection-based method
some noise (in the minority samples and majority for solving the problem of datasets with imbalanced
samples) degrades the classification performance. The classes. The work-involved data from four databases,
study proposed noise filtering using boosting and namely, Wisconsin Breast Cancer dataset, Pima Indians
bagging. The performance was evaluated the with the Diabetes dataset, the Cod-RNA dataset, and BUPAliver
state-of-the-art methods based on ensemble learning like disorders dataset. The Information Gain Technique (IGT)
AdaBoost, RUSBoost, SMOTEBoost, Bagging, and The Pearson Correlation (TPC), which are two
OverBagging, SMOTEBagging on 25 imbalance binary popular feature selection methods that were used to
class datasets with various Imbalance Ratios (IR). The compare the results of VR using a novel comparison
experimental results show that their approach works as technique, called the Ranked Order Similarity (ROS). The
promising and effective for dealing with imbalanced decision tree, logistic regression, and support vector
datasets using metrics like F-Measure and AUC. machine were used to train the classifiers and it was found
that the proposed method performed better than the
Bauder and Khoshgoftaar in [16] focused on finding benchmarks used in the experiment.
the ability to recognize the fraudulent activities of
Medicare Part B (i.e., medical insurance) providers, which While there have been many studies on financial fraud
comprised of falsified actions, which was the exploitation detection, class imbalance problems and classification
of patients and the billing for non-rendered services. algorithms using machine learning, the overwhelming

http://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 10, No.1, 277-286 (Feb-2021) 279

conclusion is that misclassification of fraud and non-fraud learning classifier for regression and classification. The
transactions continues to be a persisting problem when the ensemble technique is made up of numerous decision
dataset is imbalanced. There has been little research to trees; during the experiment, the forest was made of 600
find the best combination of the data-point approach with trees. According to Jiang et al. in [23], the trees each
the classification algorithm to address class imbalance in produce a class prediction and the class with more
credit card fraud. To investigate this problem, the study occurrences come to be the final prediction of the
examined four well-known ML fraud identification classifier. The four classification algorithms are used with
algorithms with imbalanced credit card fraud dataset to the data-point approach to find the best combination and
determine whether using the Oversampling method based strategy for solving the class imbalance problem in credit
on Synthetic Minority Oversampling Technique (SMOTe) card fraud detection.
improves the predictive accuracy. The performance of
credit card fraud identification models was then analyzed B. Dataset
using standard performance metrics. This paper provides The experiment was conducted using a credit card
an intensive comparative and statistical analysis of the dataset from a provider called Kaggle found at
prediction results. https://www.kaggle.com/mlg-ulb/creditcardfraud/home.
The dataset comprises of European cardholders’
The remainder of this paper is structured as follows; transactions, where there are 492 frauds out of a sample
discussion of the Research Methodology in section 3; the size of 284807 transactions. The minority class, which
presentation of the experimental results, discussion, and was recorded as actual fraud cases in the dataset only
conclusion of the study in section 4. made up for 0.172% of all transactions.
3. RESEARCH METHODOLOGY 𝑓𝑟𝑎𝑢𝑑
𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
∗ 100 = fraudcases
An experimental study was conducted surveying (1)
oversampling, one of the data-point level techniques to
prove the effect of handling class imbalance on the credit There are 31 features in the dataset. Features V1, V2,
card dataset. The design of an experimental research is up to V28 were the principal components gained through
more suitable where there is manipulation of the the Principal Component Analysis (PCA) conversion due
independent variable and the effect are tested on the to issues of confidentiality; the only features, which were
dependent variable [19]. An experimental design was not converted with PCA, were 'Time', 'Amount', and
more suitable for this study to investigate the predictive ‘Class’. The ‘Class’ feature contains a numeric value of 0
accuracy of machine learning models for fraud to indicate a normal transaction and 1 to indicate fraud.
identification before manipulation and after the The dataset was chosen because it is labelled, highly
manipulation using Oversampling to handle imbalanced imbalanced, and convenient to the researcher because it is
data on the credit card dataset. easily accessible making it more suitable for the
requirements of this experiment.
A. Classifications
The Support Vector Machine (SVM), Logistic
Regression (LR), Decision Tree (DT) and Random Forest
(RF) algorithms were selected for the experiment. The
algorithms were used to train and test the fraud detection
model following the train, test, and predict approach.
Support Vector Machine is a classification algorithm that
uses supervised learning to distinguish normal and fraud
classes [20]. SVM will create a hyperplane to segregate
the transactions by grouping them on either side of the
hyperplane as normal or fraud respectively. Logistic
Regression is a statistical classifier used to allocate
interpretations to individually separate and distinct set of
classes. The classification is transformed using the logistic
sigmoid function to return a probability value, which can Figure 1. Original Transaction Class Distribution
then be mapped, to be either normal or fraud. Logistic
Regression predictions allow only specific categories or Fig. 1 shows a bar graph representation of the
values [21]. Decision Tree is a method for making a frequency of the normal (legitimate) classes versus fraud
Decision Tree from training data classification. The classes. A dataset is imbalanced if at least one of the
classifier creates a tree like structures, where, the leaves classes constitutes only a very small minority. The bar for
symbolize the classifications, the non-leaf nodes
the fraud class is almost invisible in Figure 1. An
symbolize features, and the branches symbolize
imbalance dataset is best evaluated with sensitivity
combinations of features that lead to the classifications
[22]. The Random Forest algorithm is a supervised matrices, whereas, a balanced dataset is best evaluated

http://journals.uob.edu.bh
280 Nhlakanipho Mqadi, et. al.: A SMOTe based Oversampling Data-Point Approach to …

using the standard accuracy score [24]. In this paper, we the effect of the class imbalance, and it has the flexibility
observed both sensitivity and standard performance to be used with the latest algorithms such as support
matrices on both the balanced and imbalanced datasets to vector machines, decision tree, and logistic regression as
ensure a fair comparison and to gain an in-depth stated by Hassib in [25]. Our paper used the
understanding of the ability to predict the positive and Oversampling technique to investigate the effect of the
negative classes of the credit card fraud dataset. data-point method. Oversampling refers to increasing the
count of the minorities to balance with the majority class.
According to Somasundaran & Reddy in [26], this method
tends to duplicate the data already available or generate
data based on available data. Oversampling attempts to
balance the dataset by adding the number of minority
classes. The objective of using oversampling is to avoid
losing samples of the majority class, because that could
result in losing some valuable data. Instead, new samples
of the minority classes are produced using methods such
as SMOTe, bootstrapping, and repetition [27].
The experiment was conducted using SMOTe, which
is a method based on nearest neighbours judged by
Euclidean Distance amongst data points within a feature
space. The number of artificial samples to be produced is
indicated by a percentage passed as a parameter and this
percentage is always a multiple of 100 [28]. An
Figure 2. Amount per transaction by class Oversampling percentage of 100 will create new samples
Fig. 2 provides a visual representation of the amount in for each minority instance, therefore doubling the total
dollars that the fraudulent transactions cater for in the count of the minority class in the dataset, for example, the
dataset versus the legitimate transactions. The amount is 492 fraud cases would become 984. Likewise, an
within the range of the majority of the normal amount, oversampling percentage of 200 would triple the total
which makes it difficult to use amount as a parameter to count of the minority class.
distinguish between the classes of transactions. In SMOTe,
 The k nearest neighbours is established for each
of the minority class, given that they are
belonging to the same class.
(SMOTe %) /100 = k (2)
 The difference between the feature vector of the
considered instance and the feature vectors of the
k nearest neighbours are found. So, k number of
difference vectors are obtained.
 Each of the k difference vectors are multiplied
using a random number between the range of 0
and 1 (exclusive of 0 and 1).
 Lastly, at each repetition, the product of the
random numbers and the difference vectors, are
added to the feature vector of the original
minority instance.
Figure 3. Time of transaction vs Amount by class
Resampling using SMOTe was implemented by
The plot in Fig. 3 shows a visual representation of how importing and inheriting a library from imblearn to reduce
often do fraudulent versus legitimate transactions occur development time. The implementation was conducted by
during certain periods. calling the SMOTe method and passing parameters. Using
inheritance allowed the researcher to reuse existing code
to reduce programming time, increase efficiency and to
C. The Data-Point Approach allow flexibility. Table 1 below shows the parameters and
This study investigated using the data-point approach the values used during the experiments [29].
to solve the data imbalance problem. The data-point level
approach consists of interventions on the data to alleviate

http://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 10, No.1, 277-286 (Feb-2021) 281

Table 1. SMOTe method call parameters [29] of the dataset was used to test the model. The next phase
Parameter Value of the experiment was to build and train our model. We
Sampling Strategy Auto used each of the selected algorithms discussed in the
Random State None classifications section of the research methodology. We
K Neighbours 5 fit each model with the x-train and y-train training data.
M Neighbours Deprecated We then used the x-test data to try to predict the y-test
Out Step Deprecated variable.
Kind Deprecated
SVM Estimator Deprecated 4. RESULTS AND DISCUSSION
N Jobs 1 A. Results
Ratio None
Once the experiment was concluded, we compare the y-
test variable to the prediction results to generate a
During the experiment, different combination of the classification report. Precision measures the ability of a
parameters was investigated to find a combination of model to predict the positive class. Precision = TP / (TP +
parameters that produced the ideal results. SMOTe FP). Recall describes how good the model is at predicting
represents an improvement over Random Oversampling in
the positive class when the actual outcome is positive.
that the minority class is oversampled resulting in a sub-
optimal performance [30–31]. However, Douzas et al. in Recall = TP / (TP + FN). A precision-recall curve is a plot
[32] stated that, in highly imbalanced datasets, too much of the precision (y-axis) and the recall (x-axis) for
Oversampling might result in overfitting. To combat this different thresholds. Askari in [33] stated that, using both
issue of oversampling we used data-point approach with recall and precision is valuable to measure the predictive
SMOTe to interpolate existing dataset to generate new strengths of the model in situations where the distribution
instances. This approach aims at eliminating both between two classes is imbalanced. The F₁ score is the
between-class imbalances and within-class imbalances accuracy measurement of the test. Both the precision and
hence avoiding the generation of random samples. the recall score are considered when calculating the F1
D. Experiment score. The initial results we achieved with an imbalance
The study used the python programming language dataset. The sensitivity perfomance metrices used to
and Google Colab. Python offers succinct and human evaluate the imbalance dataset results are:
readable code, a wide range of libraries and frameworks
for implementing ML algorithms that will reduce The Precision score that was calculated as follows:
𝑇𝑟𝑢𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
development time hence it will be more suitable for this Precision = 𝑇𝑟𝑢𝑒
+ 𝐹𝑎𝑙𝑠𝑒
𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
study. The code was executed on the Google Colab (3)
notebook, which execute code on Google’s cloud servers,
leveraging the power of Google hardware, including The Recall score that was calculated as follows:
Graphics Processing Units (GPUs) and Tensor processing 𝑇𝑟𝑢𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
unit (TPUs), running on a Google browser. The first step Recall = 𝑇𝑟𝑢𝑒
+ 𝐹𝑎𝑙𝑠𝑒
𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑛𝑒𝑔𝑒𝑡𝑖𝑣𝑒𝑠
was to import all the libraries. Once all the libraries were (4)
imported, the creditcard.csv dataset was uploaded. The
dataset was validated to ensure that there were no null The 𝐹1 score that was calculated as follows:
values or missing columns. An exploratory data analysis 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑟𝑒𝑐𝑎𝑙𝑙
F1 = 2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
was performed to visualize and gain insight on the data.
We then identified independent and dependent features. (5)
The dependent feature was stored separately on the Y
variable and the independent features were stored in the The Average Precision (AP) is a score that is
X variable. The Y variable was the column that contained computed from the prediction score. AP summarizes a
the indicator, of whether the transaction was normal precision-recall curve as the weighted mean of precisions
(labelled as 0) or fraud (labelled as 1), which was the achieved at each threshold, with the increase in recall
variable we were trying to predict. The next step was to from the previous threshold used as the weight [34]:
split the data into a training set and a testing set using a
class from the sklearn library to call the train-test-split ∑𝑛∗ (𝑅𝑛 − 𝑅𝑛−1 )𝑃𝑛 = AP
function. The train-test-split function accepts the (6)
independent variable X, dependent variable Y and test The above formula computes the average precision,
size. The test-size parameter specifies the ratio to split the where P_n and R_n are the precision and recall at the nth
original size of the dataset, which indicate that 70% of threshold. Precision and recall are always between 0 and
the original dataset was used to train the model and 30%

http://journals.uob.edu.bh
282 Nhlakanipho Mqadi, et. al.: A SMOTe based Oversampling Data-Point Approach to …

1. Therefore, AP falls within 0 and 1, AP is metric used to which represent a classifier with poor performance. In
measure the accuracy of a classifier, which means if our case, the SVM classifier performed the worse than all
number is closer to 1, the classifier is more accurate. To other classifies. There was high bias towards the majority
present the results, the zero (0) was used represent class.
legitimate transactions and the one (1) represent the
fraudulent transactions. The lowest possible value is
represented by 0.00 (0%) and the highest possible value is
represented as 1.00 (100%). Table 2 below uses ALG for
algorithm, C for Class, P for Precision, R for recall, F1 for
F1-score, and AC for accuracy. Table 2 below compares
the scores of all the four classifies; SVM, LR, DT, and RF
before Oversampling. Table 2 also shows the initial
classification report comparison for all the algorithms
before the data-point level approach technique was
applied on the credit card dataset.

Table 2. Comparison of imbalance dataset classification before


Oversampling Figure 5. Logistic Regression Precision-Recall curve

ALG C P R F1 AC Fig. 5 shows the precision-recall curve of the


SVM 0 1.00 1.00 1.00 1.00 Logistic regression classification where the average
1 0.00 0.00 0.00 AP = 0.00 precision computed was 0.46.
LR 0 1.00 1.00 1.00 1.00
1 0.42 0.47 0.44 AP = 0.46
DT 0 1.00 1.00 1.00 1.00
1 0.58 0.65 0.61 AP = 0.38
RF 0 1.00 1.00 1.00 1.00
1 0.90 0.53 0.67 AP = 0.48

The closer the curve to the value of one on upper right


corner, the better the quality. If the the curve is leaning
towards the lower left corner, then the quality of the
classification is poor. Fig 4, 5, 6, and 7 below are the
precision-recall curve before Oversampling with SMOTe
was applied. The curves represent the quality of each
classifier with an imbalance dataset.
Figure 6. Decision Tree Precision-Recall curve
Fig. 6 shows the precision-recall curve of the
Decision tree classification where the average precision
computed was 0.38.

Figure 4. SVM Precision-Recall curve


Fig. 4 shows the precision-recall curve of the SVM
classification where the average precision computed was
0.00. The SVM is leaning towards the lower left corner,
Figure 7. Random forest Precision-Recall curve

http://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 10, No.1, 277-286 (Feb-2021) 283

Fig. 7 shows the precision-recall curve of the Random Table 3 shows high precision, recall, F1-score and
forest classification where the average precision accuracy for the decision tree and random forest.
computed was 0.48. Fig. 9, 10, 11 and 12 plots the respective Precision-
B. Oversampling with SMOTe Recall curve of the classification after Oversampling with
SMOTe. The goal is to observe whether the P-R curve is
The next step of the experiment was to use SMOTe to
resample the original dataset. We used the default values towards the upper right corner of the chart to verify that
the accuracy has improved. The closer the curve to the
on most of the parameters, except for the random-state.
value of one in the y-axes, the better the quality.
The random-state parameter controls both the randomness
of the bootstrapping of the samples used when building
trees and the sampling of the features to consider when
looking for the best split at each node. After multiple
iterations, the results presented were obtained using a
random state of 42.

Figure 9. SVM Precision-Recall curve, AP = 0.53


Fig. 9 show the precision-recall curve of the SVM
classification where the average precision computed was
0.53.

Figure 8. Transaction Class Distribution after Oversampling


Fig. 8 shows the transaction class distribution after
Oversampling. The cases are evenly balanced for both
normal and fraud. The dataset was then fed into the
prediction model following the split-test-train-predict
cycle. We compared the y-test to the prediction to
generate a classification report of oversampling. Table 3
shows the classification report for all the algorithms after
Oversampling was applied to mitigate the effect caused by
class imbalance.
Table 3. Comparison of classifications after Oversampling
ALG C P R F1 AC
SVM 0 0.60 0.37 0.46 0.57 Figure 10. LR Precision-Recall curve, AP = 0.96

1 0.55 0.76 0.64 AP = 0.53


Fig. 10 shows the precision-recall curve of the
LR 0 0.97 0.97 0.97 0.97 Logistic regression classification where the average
1 0.97 0.97 0.97 AP = 0.96 precision computed was 0.96.

DT 0 1.00 1.00 1.00 1.00


1 1.00 1.00 1.00 AP = 1.00
RF 0 1.00 1.00 1.00 1.00
1 1.00 1.00 1.00 AP = 1.00

http://journals.uob.edu.bh
284 Nhlakanipho Mqadi, et. al.: A SMOTe based Oversampling Data-Point Approach to …

The SVM model was the worst performing with a


precision score of 0.00 for fraud. The score of 0.00 means
that the SVM model failed to identify fraud cases with
imbalance data. All the algorithms scored 1.00 for
legitimate cases, which means that due the to the
imbalance level, the majority class was completely
dominant. To determine whether there was any
improvement for fraud detection, the following formula
calculated the improvement for the Precision, Recall, and
F₁ score:
𝑠𝑐𝑜𝑟𝑒𝑎𝑓𝑡𝑒𝑟 − 𝑠𝑐𝑜𝑟𝑒𝑏𝑒𝑓𝑜𝑟𝑒 = % 𝑣𝑎𝑙𝑢𝑒
(7)
Figure 11. DT Precision-Recall curve, AP = 1.00
Fig. 11 shows the precision-recall curve of the After using the SMOTe Oversampling technique, the
Decision tree classification where the average precision Precision score improved by 55% for SVM, 55% for
computed was 1.00. Logistic Regression, 42% for the Decision Tree, and 10%
for Random forest for the positive class.
The Recall score shows that the strength of
identifying True Positive (which are actual fraudulent
cases) improved by 76% for SVM, 50% for Logistic
Regression, 47% for Random forest, and 39% for the
Decision Tree for the positive class.

The results reveal that F1-Score improved by 64%


for SVM, 53% for Logistic Regression, 35% for the
Decision Tree, and 33% for Random forest for the
positive class. Comparing the F1 scores show that when
the ability to detect positive classes was improved.

An interesting observation was that the classification


Figure 12. RF Precision-Recall curve, AP = 1.00 of negative class for the Logistic Regression, Decision
Fig. 12 shows the precision-recall curve of the Tree and Random forest algorithms was good and
Random forest classification where the average precision consisted throughout the experiment. SVM performed
computed was 1.00. A P-R curve is a great way to well initially with the overall accuracy score of 100%;
provide a graphical visualization of the quality of a however, after using Oversampling, the score was 47%,
classifier. A P-R curve that is a straight line towards the meaning that even though the ability to recognize
upper right corner, such as the one of Fig. 11 and Fig. 12 positive classes improved, the ability to recognize
represents the best possible quality. The two P-R curves negative classes degraded. Therefore, SVM is not an
tell us that the classifiers were able to predict the positive ideal solution for credit card fraud detection.
classes at a 100% accuracy.
Based on the results, the Random forest algorithm is
C. Discussion the leading algorithm. The algorithms ranked from best
This section provides a discussion of the classification in the following order: Random forest, Decision tree,
results before and after using oversampling by Logistic regression, and SVM.
highlighting the improvements observed and ranking the
algorithms. The classification report with the original
D. Conclusion
dataset revealed that the Random forest model was the
best performer, the precision score of 0.90, but the Recall The results show that if the data is highly
was 0.53, therefore, the cross-validating shows that the imbalanced, the model struggles to detect fraudulent
precision score is misleading. To further validate the transactions. After using the SMOTe based
model, the computed average precision was 0.48, Oversampling technique, which is a data-point approach,
revealing that the model was not producing the ideal there was a significant improvement to the ability to
performance and further improvements were necessary. predict positive classes. Based on the findings, the
random forest and decision tree algorithms produced the
best performance with credit card dataset.

http://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 10, No.1, 277-286 (Feb-2021) 285

Future research can perform a cross validation or [13] K. Randhawa, C. K. Loo, M. Seera, C. P. Lim, and A. K. Nandi,
Credit Card Fraud Detection Using AdaBoost and Majority
comparison across multiple datasets to verify the Voting. IEEE Access, vol. 6, no. 1, pp. 14277-14284, 2018.
consistency of the data-point approach in handling [14] A. Chouiekha, E. Hassane, and E. Haj, ConvNets for Fraud
imbalance credit card fraud datasets. Further studies can Detection analysis. Procedia Computer Science, vol. 127, no. 1,
investigate building and deploying a real-time solution pp133–138, 2018. https://doi.org/10.1016/j.procs.2018.01.107
that can detect fraud as and when the transaction is [15] G. Rekha, A. K. Tyagi, and R. V. Krishna, Solving Class
occurring. Imbalance Problem Using Bagging, Boosting Techniques, with
and Without Using Noise Filtering Method. International Journal
of Hybrid Intelligent Systems, vol. 15, no. 2, pp. 67–76, 2019.
DOI: 10.3233/HIS-190261
ACKNOWLEDGMENT
[16] R. A. Bauder and T. M. Khoshgoftaar, The effects of varying class
KIND ACKNOWLEDGMENT TO THE DURBAN distribution on learner behaviour for Medicare fraud detection
with imbalanced big data. Health Information Science and
UNIVERISTY OF TECHNOLOGY FOR PROVIDING THE Systems, vol. 6, no. 9, pp. 1-14, 2018. doi: 10.1007/s13755-018-
RESOURCES FOR THIS RESEACH STUDY. 0051-3
[17] X. Liu, J. Wu, and Z. Zhou, Exploratory Undersampling for
REFERENCES Class-Imbalance Learning. IEEE Transactions On Systems, Man,
[1] D. Robertson, The Nelson Report. October. Available online: And Cybernetics, vol. 39, no. 2, pp. 539-550, 2009. Doi:
10.1.1.309.1465
http://www.nelsonreport.com/upload/content_promo/The_Nelson
_Report_10-17-2016.pdf (accessed on 03 February 2019). [18] S. H. Ebenuwa, S. Sharif, and M. Alazab, Variance Ranking
[2] D. Huang, D. Mu, L. Yang, and X. Cai, CoDetect: Financial Fraud Attributes Selection Techniques for Binary Classification Problem
Detection with Anomaly Feature Detection. National Natural in Imbalance Data. IEEE Access, vol. 7, no. 1, pp. 24649-24666,
Science Foundation of China, vol. 6, no. 2, pp. 19161-19174, 2019.
2018. DOI: 10.1109/ACCESS.2018.2816564 [19] L. S. Feldt, A comparison of the precision of three experimental
[3] C. Jiang, J. Song, G. Liu, L. Zheng, and W. Luan, Credit Card designs employing a concomitant variable. Psychometrika, vol.
Fraud Detection: A Novel Approach Using Aggregation Strategy 23, no. 1, pp. 335-353, 1958. DOI:
and Feedback Mechanism. IEEE Internet of Things Journal, vol. https://doi.org/10.1007/BF02289783
5, no. 5, pp. 3637-3647, 2018. DOI:10.1109/JIOT.2018.2816007 [20] E. Lejon, P. Kyosti, and J. Lindstrom, Machine learning for
[4] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, Handling detection of anomalies in press-hardening: Selection of efficient
imbalanced datasets: A review. GESTS International Transactions methods. Process IT Innovations R&D Centre, vol. 1, no. 1, pp.
1079-1083, 2018. https://doi.org/10.1016/j.procir.2018.03.221
on Computer Science and Engineering, vol.30, no. 1, pp. 25-36,
2006. DOI: https://doi.org/10.1007/s13369-016-2179-2 [21] G. Baader, and H. Krcmar, Reducing false positives in fraud
[5] A. O. Adewumi, and A. A. Akinyelu, A survey of machine - detection: Combining the red flag approach with process mining.
learning and nature-inspired based credit card fraud detection International Journal of Accounting Information Systems, vol. 31
techniques. Int J Syst Assur Eng Manag, vol. 8, no. 2, pp. 937- no. 1, pp. 1–16, 2018. https://doi.org/10.1016/j.accinf.2018.03.004
953, 2017. https://doi.org/10.1007/s13198-016-0551-y [22] I. Sadgali, N. Sael, and F. Benabbou, Performance of machine
[6] Y. Bian, M. Cheng, C. Yang, Y. Yuan, Q. Li, and J. L. Zhao et al., learning techniques in the detection of financial frauds. Procedia
Financial fraud detection: a new ensemble learning approach for Computer Science, vol. 148, no. 1, pp. 45-54, 2018.
imbalanced data. PACIS 2016 Proceedings, vol. 315, no. 1, pp. 1- https://doi.org/10.1016/j.procs.2019.01.007
11, 2016. [23] K. Jiang, J. Lu, K. Xia, and L. Zheng, A Novel Algorithm for
[7] A. T. Elhassan, M. Aljourf, F. Al-Mohanna, and M. Shoukri, Imbalance Data Classification Based on Genetic Algorithm
Classification of Imbalance Data using Tomek Link (T-Link) Improved SMOTE. Arab J Sci Eng, vol. 41, no. 1, pp. 3155-3266,
Combined with Random Under-sampling (RUS) as a Data 2016. DOI: https://doi.org/10.1007/s13369-016-2179-2
Reduction Method. Global J Technol Optim, vol. 1, no. 1, pp. 1- [24] N. Malini, and M. Pushpa, Analysis on Credit Card Fraud
11, 2017. DOI: 10.4172/2229-8711.S1111 Identification Techniques based on KNN and Outlier Detection.
[8] K. Sotiris, K. Dimitris, and P. Panayiotis, Handling imbalanced Advances in Electrical, Electronics, Information, Communication
datasets: A review. GESTS International Transactions on and Bio-Informatics (AEEICB), vol. 1, no. 1, pp 1-12, 2017. DOI:
Computer Science and Engineering, vol. 30, no. 1, pp. 1-12, 2016. 10.1109/AEEICB.2017.7972424
[9] M. Zanin, M. Romance, S. Moral, and R. Criado, Credit card [25] E. M. Hassib, A. I. El-Desouky, E. M. El-Kenawy, and S. M.
fraud detection through parenclitic network analysis. IEEE Ghamrawy, Imbalanced Big Data Mining Framework for
Access, vol. 1, no. 1, pp. 1-8, 2017. Improving Optimization Algorithms Performance. IEEE Access,
https://doi.org/10.1155/2018/5764370 vol. 7 no. 1, pp. 170774-170795, 2019.
[10] S. Patil, V. Nemade, and P. Kumar, Predictive Modelling for [26] A. Somasundaran, and U. S. Reddy, Data Imbalance: Effects and
Credit Card Fraud Detection Using Data Analytics. Computational Solutions for Classification of Large and Highly Imbalanced Data.
Intelligence and Data Science, vol. 132, no. 1, pp. 385–395, 2018. Proc. of 1st International Conference on Research in Engineering,
https://doi.org/10.1016/j.procs.2018.05.199 Computers and Technology, vol. 25, no. 10, pp. 28- 34, 2016.
[11] M. Albashrawi, Detecting Financial Fraud Using Data Mining [27] A. D. Pozzolo, G. Boracchi, O. Caelen, C. Alippi, and G.
Techniques: A Decade Review from 2004 to 2015. Journal of Bontempi, Credit Card Fraud Detection: A Realistic Modelling
Data Science, vol. 14, no. 1, pp. 553-570, 2016. and a Novel Learning Strategy. Transactions on neural networks
and learning systems, vol. 29, no. 8, pp. 3784-3797, 2018. DOI:
[12] R. A. Bauder, and T. M. Khoshgoftaar, The effects of varying 10.1109/TNNLS.2017.2736643
class distribution on learner behavior for medicare fraud detection
with imbalanced big data. Health Information Science and [28] A. Mubalik, and E. Adali, Multilayer Perception Neural network
Systems, vol, 6 no. 9, pp. 1-14, 2018. technique for fraud detection. Computer Science and Engineering
(UBMK), vol. 1, no. 1, pp. 383-387, 2017.

http://journals.uob.edu.bh
286 Nhlakanipho Mqadi, et. al.: A SMOTe based Oversampling Data-Point Approach to …

[29] Imbalanced-learn. Available online: https://imbalanced-


learn.readthedocs.io/en/stable (accessed on 15 June 2020). Mr. Nhlakanipho M. Mqadi
[30] K. Jiang, J. Lu, and K. Xia, A novel algorithm for imbalance was born in Durban, South
data classification based on genetic algorithm improved SMOTE. Africa. He is currently a student
Arabian journal for science and engineering, vol. 41, no. 8, pp. towards a master’s degree in
3255-3266, 2016.
Information and Communication
[31] A. Agrawal, H. L. Viktor, and E. Paquet, November. SCUT: Technology at the Durban
Multi-class imbalanced data classification using SMOTE and
cluster-based undersampling. In 2015 7th International Joint
University of Technology
Conference on Knowledge Discovery, Knowledge Engineering (DUT). Has obtained a Bachelor
and Knowledge Management (IC3K). IEEE, vol. 1, pp. 226-234, of Technology degree in
2015. Information Technology (IT)
[32] G. Douzas, F. Bacao, and F. Last, Improving imbalanced (cum laude) and a 3 year
learning through a heuristic oversampling method based on k- National Diploma in IT, both
means and SMOTE. Information Sciences, vol. 465, no. 1, pp. 1- from the Durban University of Technology. He is a member of
20, 2018. the Golden Key International Honour Society.
[33] S. Askari, and A. Hussain, Credit Card Fraud Detection Using
Fuzzy ID3. Computing. Communication and Automation
(ICCCA), vol. 40, no. 1, pp. 446-452, 2017. DOI:
10.1109/CCAA.2017.8229897
Dr N. Naicker education
[34] J. West, and M. Bhattacharya, Some Experimental Issues in
Financial Fraud Mining. Procedia Computer Science, vol. 80, no. background is as follows: PhD
1, pp. 1734–1744, 2016. [Information Systems &
Technology]; MSc [Information
Systems]; Hons BSc (Computer
Science); BSc (Computer Science).
He currently serves as head of the
Information Systems Department
at the Durban University of
Technology. He is currently
involved with the supervision of PhD and Masters students at
the Department of Information Systems. He is a member of the
ICT and Society Research Group for the Faculty of Accounting
and Informatics at Durban University of Technology.

Dr. T. Adeliyi is an academic in the


department of Information
Technology at the Durban
University of Technology. Active
researcher in the field of computer
science and has research interests in
machine learning, digital image
processing and intelligent systems.
He is a member of the ICT and
Society Research Group of Durban University of Technology.

http://journals.uob.edu.bh

You might also like