A SMOTe Based Oversampling Data-Point Approach To Solving The Credit Card Data Imbalance Problem in Financial Fraud Detection 2020
A SMOTe Based Oversampling Data-Point Approach To Solving The Credit Card Data Imbalance Problem in Financial Fraud Detection 2020
ISSN (2210-142X)
Int. J. Com. Dig. Sys. 10, No.1 (Feb-2021)
http://dx.doi.org/10.12785/ijcds/100128
Received 16 Aug. 2020, Revised 15 Sep. 2020, Accepted 25 Dec. 2020, Published 8 Feb. 2021
Abstract: Credit card fraud has negatively affected the market economic order, broken the confidence and interest of stakeholders,
financial institutions, and consumers. Losses from card fraud is increasing every year with billions of dollars being lost. Machine
Learning methods use large volumes of data as examples for learning to improve the performance of classification models. Financial
institutions use Machine Learning to identify fraudulent patterns from the large amounts of historical financial records. However, the
detection of credit card fraud remains as a significant challenge for business intelligence technologies as most datasets containing
credit card transactions are highly imbalanced. To overcome this challenge, this paper proposed the use of the data-point approach in
machine learning. An experimental study was conducted applying Oversampling with SMOTe, a data-point approach technique, on
an imbalanced credit card dataset. State-of-the-art classical machine learning algorithms namely, Support Vector Machines, Logistic
Regression, Decision Tree and Random Forest classifiers were used to perform the classifications and the accuracy was evaluated
using precision, recall, F1-score, and the average precision metrics. The results show that if the data is highly imbalanced, the model
struggles to detect fraudulent transactions. After using the SMOTe based Oversampling technique, there was a significant
improvement to the ability to predict positive classes.
failure to handle imbalance data compromises the Providers and individuals that have been expelled from
integrity and predictive abilities of machine learning partaking in Federal healthcare programmes in the United
system resulting in high financial impact. The data-point States committed this fraud. The study discusses the
level approach consists of techniques for re-sampling the processing of Part B dataset and proposed a novel fraud
data in order to deal with imbalanced classes. These label mapping method using the providers that have been
techniques include oversampling, under-sampling, and recognized as fraudulent. The dataset was labelled and
feature selection [8]. The aim of this paper was to assert extremely imbalanced with only a few number of cases
the precision, recall, and F1 score of ML algorithms which were flagged as fraud. Seven class distributions
before and after the application of the data-point were generated from the dataset and their behaviors were
technique. The scope of this paper covers the investigation evaluated using six ML techniques, in the interest of
of ML model’s predictive accuracy with imbalance credit fighting the class imbalance problem while also achieving
card dataset. The term accuracy can be defined as the a good fraud identification performance. The findings
percentage of correctly classified instances (TP + TN) / revealed that the learner with the best Area Under the
(TP + TN + FP + FN). Where TP, FN, FP and TN ROC Curve (AUC) score of 0.87302 was RF100 using a
represent the number of true positives, false negatives, class distribution of 90:10. In addition, learners using a
false positives and true negatives, respectively. Predictive class distribution that is more balanced as the 50:50
Accuracy refers to the ability to classify legitimate and distribution produced less favourable results. The study
fraudulent transactions successfully. concluded that keeping more of the dominant class
improved the ability to detect Medicare Part B fraud.
2. RELARED WORK
Similarly, Liu et al. in [17] conducted an experiment
Many other studies [9-11] reviewed and compared the to propose two algorithms to overcome the deficiency of
existing financial fraud detection models to identify the using under-sampling in handling the problem of class
method with the best performance. Patil et al. in [10] used imbalance. The deficiency was that when under-sampling
the confusion matrix and found that, the Random Forest is applied, many majority classes are ignored. Therefore,
model performed better as compared to Logistic the study proposed EasyEnsemble and BalanceCascade.
Regression and Decision Tree in terms of accuracy, EasyEnsemble divides the majority class into several
precision and recall parameters, whereas, Albashrawi in smaller chunks, then the chunks are independently used to
[11] found that the Logistic Regression model appeared to train the learner and at the end, all the outputs by the
be the leading machine learning technique in detecting learners are combined. BalanceCascade uses a sequential
financial fraud. Other researchers [12-13] have proposed training-based approach, wherein each sequence, the
using a hybrid approach. These approaches show some correctly classified examples of the majority class are
improvements on the existing methods and recognize eliminated from being further evaluated in the next
strengths of fraud detection models; for example, sequence. The findings showed that compared to many
Chouiekha et al. in [14] who found that Deep Learning existing methods, both the EasyEnsemble and
algorithms such as Convolution Neural Networks (CNN) BalanceCascade have a higher F-measure, G-mean, and
technique has better accuracy versus traditional machine AUC values and the training time was found to be closely
learning algorithms. Rekha et al. in [15] presented a similar to under-sampling, which according to Liu et al. in
comparison of the performance of several boosting and [17], was significantly faster compared to other
bagging techniques from imbalanced datasets. According approaches.
to Rekha et al. in [15], Oversampling technique takes full
minority samples in the training data into consideration A paper by Ebenuwa et al. in [18] presented Variance
while performing classification. However, the presence of Ranking (VR), which is a feature selection-based method
some noise (in the minority samples and majority for solving the problem of datasets with imbalanced
samples) degrades the classification performance. The classes. The work-involved data from four databases,
study proposed noise filtering using boosting and namely, Wisconsin Breast Cancer dataset, Pima Indians
bagging. The performance was evaluated the with the Diabetes dataset, the Cod-RNA dataset, and BUPAliver
state-of-the-art methods based on ensemble learning like disorders dataset. The Information Gain Technique (IGT)
AdaBoost, RUSBoost, SMOTEBoost, Bagging, and The Pearson Correlation (TPC), which are two
OverBagging, SMOTEBagging on 25 imbalance binary popular feature selection methods that were used to
class datasets with various Imbalance Ratios (IR). The compare the results of VR using a novel comparison
experimental results show that their approach works as technique, called the Ranked Order Similarity (ROS). The
promising and effective for dealing with imbalanced decision tree, logistic regression, and support vector
datasets using metrics like F-Measure and AUC. machine were used to train the classifiers and it was found
that the proposed method performed better than the
Bauder and Khoshgoftaar in [16] focused on finding benchmarks used in the experiment.
the ability to recognize the fraudulent activities of
Medicare Part B (i.e., medical insurance) providers, which While there have been many studies on financial fraud
comprised of falsified actions, which was the exploitation detection, class imbalance problems and classification
of patients and the billing for non-rendered services. algorithms using machine learning, the overwhelming
http://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 10, No.1, 277-286 (Feb-2021) 279
conclusion is that misclassification of fraud and non-fraud learning classifier for regression and classification. The
transactions continues to be a persisting problem when the ensemble technique is made up of numerous decision
dataset is imbalanced. There has been little research to trees; during the experiment, the forest was made of 600
find the best combination of the data-point approach with trees. According to Jiang et al. in [23], the trees each
the classification algorithm to address class imbalance in produce a class prediction and the class with more
credit card fraud. To investigate this problem, the study occurrences come to be the final prediction of the
examined four well-known ML fraud identification classifier. The four classification algorithms are used with
algorithms with imbalanced credit card fraud dataset to the data-point approach to find the best combination and
determine whether using the Oversampling method based strategy for solving the class imbalance problem in credit
on Synthetic Minority Oversampling Technique (SMOTe) card fraud detection.
improves the predictive accuracy. The performance of
credit card fraud identification models was then analyzed B. Dataset
using standard performance metrics. This paper provides The experiment was conducted using a credit card
an intensive comparative and statistical analysis of the dataset from a provider called Kaggle found at
prediction results. https://www.kaggle.com/mlg-ulb/creditcardfraud/home.
The dataset comprises of European cardholders’
The remainder of this paper is structured as follows; transactions, where there are 492 frauds out of a sample
discussion of the Research Methodology in section 3; the size of 284807 transactions. The minority class, which
presentation of the experimental results, discussion, and was recorded as actual fraud cases in the dataset only
conclusion of the study in section 4. made up for 0.172% of all transactions.
3. RESEARCH METHODOLOGY 𝑓𝑟𝑎𝑢𝑑
𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
∗ 100 = fraudcases
An experimental study was conducted surveying (1)
oversampling, one of the data-point level techniques to
prove the effect of handling class imbalance on the credit There are 31 features in the dataset. Features V1, V2,
card dataset. The design of an experimental research is up to V28 were the principal components gained through
more suitable where there is manipulation of the the Principal Component Analysis (PCA) conversion due
independent variable and the effect are tested on the to issues of confidentiality; the only features, which were
dependent variable [19]. An experimental design was not converted with PCA, were 'Time', 'Amount', and
more suitable for this study to investigate the predictive ‘Class’. The ‘Class’ feature contains a numeric value of 0
accuracy of machine learning models for fraud to indicate a normal transaction and 1 to indicate fraud.
identification before manipulation and after the The dataset was chosen because it is labelled, highly
manipulation using Oversampling to handle imbalanced imbalanced, and convenient to the researcher because it is
data on the credit card dataset. easily accessible making it more suitable for the
requirements of this experiment.
A. Classifications
The Support Vector Machine (SVM), Logistic
Regression (LR), Decision Tree (DT) and Random Forest
(RF) algorithms were selected for the experiment. The
algorithms were used to train and test the fraud detection
model following the train, test, and predict approach.
Support Vector Machine is a classification algorithm that
uses supervised learning to distinguish normal and fraud
classes [20]. SVM will create a hyperplane to segregate
the transactions by grouping them on either side of the
hyperplane as normal or fraud respectively. Logistic
Regression is a statistical classifier used to allocate
interpretations to individually separate and distinct set of
classes. The classification is transformed using the logistic
sigmoid function to return a probability value, which can Figure 1. Original Transaction Class Distribution
then be mapped, to be either normal or fraud. Logistic
Regression predictions allow only specific categories or Fig. 1 shows a bar graph representation of the
values [21]. Decision Tree is a method for making a frequency of the normal (legitimate) classes versus fraud
Decision Tree from training data classification. The classes. A dataset is imbalanced if at least one of the
classifier creates a tree like structures, where, the leaves classes constitutes only a very small minority. The bar for
symbolize the classifications, the non-leaf nodes
the fraud class is almost invisible in Figure 1. An
symbolize features, and the branches symbolize
imbalance dataset is best evaluated with sensitivity
combinations of features that lead to the classifications
[22]. The Random Forest algorithm is a supervised matrices, whereas, a balanced dataset is best evaluated
http://journals.uob.edu.bh
280 Nhlakanipho Mqadi, et. al.: A SMOTe based Oversampling Data-Point Approach to …
using the standard accuracy score [24]. In this paper, we the effect of the class imbalance, and it has the flexibility
observed both sensitivity and standard performance to be used with the latest algorithms such as support
matrices on both the balanced and imbalanced datasets to vector machines, decision tree, and logistic regression as
ensure a fair comparison and to gain an in-depth stated by Hassib in [25]. Our paper used the
understanding of the ability to predict the positive and Oversampling technique to investigate the effect of the
negative classes of the credit card fraud dataset. data-point method. Oversampling refers to increasing the
count of the minorities to balance with the majority class.
According to Somasundaran & Reddy in [26], this method
tends to duplicate the data already available or generate
data based on available data. Oversampling attempts to
balance the dataset by adding the number of minority
classes. The objective of using oversampling is to avoid
losing samples of the majority class, because that could
result in losing some valuable data. Instead, new samples
of the minority classes are produced using methods such
as SMOTe, bootstrapping, and repetition [27].
The experiment was conducted using SMOTe, which
is a method based on nearest neighbours judged by
Euclidean Distance amongst data points within a feature
space. The number of artificial samples to be produced is
indicated by a percentage passed as a parameter and this
percentage is always a multiple of 100 [28]. An
Figure 2. Amount per transaction by class Oversampling percentage of 100 will create new samples
Fig. 2 provides a visual representation of the amount in for each minority instance, therefore doubling the total
dollars that the fraudulent transactions cater for in the count of the minority class in the dataset, for example, the
dataset versus the legitimate transactions. The amount is 492 fraud cases would become 984. Likewise, an
within the range of the majority of the normal amount, oversampling percentage of 200 would triple the total
which makes it difficult to use amount as a parameter to count of the minority class.
distinguish between the classes of transactions. In SMOTe,
The k nearest neighbours is established for each
of the minority class, given that they are
belonging to the same class.
(SMOTe %) /100 = k (2)
The difference between the feature vector of the
considered instance and the feature vectors of the
k nearest neighbours are found. So, k number of
difference vectors are obtained.
Each of the k difference vectors are multiplied
using a random number between the range of 0
and 1 (exclusive of 0 and 1).
Lastly, at each repetition, the product of the
random numbers and the difference vectors, are
added to the feature vector of the original
minority instance.
Figure 3. Time of transaction vs Amount by class
Resampling using SMOTe was implemented by
The plot in Fig. 3 shows a visual representation of how importing and inheriting a library from imblearn to reduce
often do fraudulent versus legitimate transactions occur development time. The implementation was conducted by
during certain periods. calling the SMOTe method and passing parameters. Using
inheritance allowed the researcher to reuse existing code
to reduce programming time, increase efficiency and to
C. The Data-Point Approach allow flexibility. Table 1 below shows the parameters and
This study investigated using the data-point approach the values used during the experiments [29].
to solve the data imbalance problem. The data-point level
approach consists of interventions on the data to alleviate
http://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 10, No.1, 277-286 (Feb-2021) 281
Table 1. SMOTe method call parameters [29] of the dataset was used to test the model. The next phase
Parameter Value of the experiment was to build and train our model. We
Sampling Strategy Auto used each of the selected algorithms discussed in the
Random State None classifications section of the research methodology. We
K Neighbours 5 fit each model with the x-train and y-train training data.
M Neighbours Deprecated We then used the x-test data to try to predict the y-test
Out Step Deprecated variable.
Kind Deprecated
SVM Estimator Deprecated 4. RESULTS AND DISCUSSION
N Jobs 1 A. Results
Ratio None
Once the experiment was concluded, we compare the y-
test variable to the prediction results to generate a
During the experiment, different combination of the classification report. Precision measures the ability of a
parameters was investigated to find a combination of model to predict the positive class. Precision = TP / (TP +
parameters that produced the ideal results. SMOTe FP). Recall describes how good the model is at predicting
represents an improvement over Random Oversampling in
the positive class when the actual outcome is positive.
that the minority class is oversampled resulting in a sub-
optimal performance [30–31]. However, Douzas et al. in Recall = TP / (TP + FN). A precision-recall curve is a plot
[32] stated that, in highly imbalanced datasets, too much of the precision (y-axis) and the recall (x-axis) for
Oversampling might result in overfitting. To combat this different thresholds. Askari in [33] stated that, using both
issue of oversampling we used data-point approach with recall and precision is valuable to measure the predictive
SMOTe to interpolate existing dataset to generate new strengths of the model in situations where the distribution
instances. This approach aims at eliminating both between two classes is imbalanced. The F₁ score is the
between-class imbalances and within-class imbalances accuracy measurement of the test. Both the precision and
hence avoiding the generation of random samples. the recall score are considered when calculating the F1
D. Experiment score. The initial results we achieved with an imbalance
The study used the python programming language dataset. The sensitivity perfomance metrices used to
and Google Colab. Python offers succinct and human evaluate the imbalance dataset results are:
readable code, a wide range of libraries and frameworks
for implementing ML algorithms that will reduce The Precision score that was calculated as follows:
𝑇𝑟𝑢𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
development time hence it will be more suitable for this Precision = 𝑇𝑟𝑢𝑒
+ 𝐹𝑎𝑙𝑠𝑒
𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
study. The code was executed on the Google Colab (3)
notebook, which execute code on Google’s cloud servers,
leveraging the power of Google hardware, including The Recall score that was calculated as follows:
Graphics Processing Units (GPUs) and Tensor processing 𝑇𝑟𝑢𝑒𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
unit (TPUs), running on a Google browser. The first step Recall = 𝑇𝑟𝑢𝑒
+ 𝐹𝑎𝑙𝑠𝑒
𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑛𝑒𝑔𝑒𝑡𝑖𝑣𝑒𝑠
was to import all the libraries. Once all the libraries were (4)
imported, the creditcard.csv dataset was uploaded. The
dataset was validated to ensure that there were no null The 𝐹1 score that was calculated as follows:
values or missing columns. An exploratory data analysis 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑟𝑒𝑐𝑎𝑙𝑙
F1 = 2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
was performed to visualize and gain insight on the data.
We then identified independent and dependent features. (5)
The dependent feature was stored separately on the Y
variable and the independent features were stored in the The Average Precision (AP) is a score that is
X variable. The Y variable was the column that contained computed from the prediction score. AP summarizes a
the indicator, of whether the transaction was normal precision-recall curve as the weighted mean of precisions
(labelled as 0) or fraud (labelled as 1), which was the achieved at each threshold, with the increase in recall
variable we were trying to predict. The next step was to from the previous threshold used as the weight [34]:
split the data into a training set and a testing set using a
class from the sklearn library to call the train-test-split ∑𝑛∗ (𝑅𝑛 − 𝑅𝑛−1 )𝑃𝑛 = AP
function. The train-test-split function accepts the (6)
independent variable X, dependent variable Y and test The above formula computes the average precision,
size. The test-size parameter specifies the ratio to split the where P_n and R_n are the precision and recall at the nth
original size of the dataset, which indicate that 70% of threshold. Precision and recall are always between 0 and
the original dataset was used to train the model and 30%
http://journals.uob.edu.bh
282 Nhlakanipho Mqadi, et. al.: A SMOTe based Oversampling Data-Point Approach to …
1. Therefore, AP falls within 0 and 1, AP is metric used to which represent a classifier with poor performance. In
measure the accuracy of a classifier, which means if our case, the SVM classifier performed the worse than all
number is closer to 1, the classifier is more accurate. To other classifies. There was high bias towards the majority
present the results, the zero (0) was used represent class.
legitimate transactions and the one (1) represent the
fraudulent transactions. The lowest possible value is
represented by 0.00 (0%) and the highest possible value is
represented as 1.00 (100%). Table 2 below uses ALG for
algorithm, C for Class, P for Precision, R for recall, F1 for
F1-score, and AC for accuracy. Table 2 below compares
the scores of all the four classifies; SVM, LR, DT, and RF
before Oversampling. Table 2 also shows the initial
classification report comparison for all the algorithms
before the data-point level approach technique was
applied on the credit card dataset.
http://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 10, No.1, 277-286 (Feb-2021) 283
Fig. 7 shows the precision-recall curve of the Random Table 3 shows high precision, recall, F1-score and
forest classification where the average precision accuracy for the decision tree and random forest.
computed was 0.48. Fig. 9, 10, 11 and 12 plots the respective Precision-
B. Oversampling with SMOTe Recall curve of the classification after Oversampling with
SMOTe. The goal is to observe whether the P-R curve is
The next step of the experiment was to use SMOTe to
resample the original dataset. We used the default values towards the upper right corner of the chart to verify that
the accuracy has improved. The closer the curve to the
on most of the parameters, except for the random-state.
value of one in the y-axes, the better the quality.
The random-state parameter controls both the randomness
of the bootstrapping of the samples used when building
trees and the sampling of the features to consider when
looking for the best split at each node. After multiple
iterations, the results presented were obtained using a
random state of 42.
http://journals.uob.edu.bh
284 Nhlakanipho Mqadi, et. al.: A SMOTe based Oversampling Data-Point Approach to …
http://journals.uob.edu.bh
Int. J. Com. Dig. Sys. 10, No.1, 277-286 (Feb-2021) 285
Future research can perform a cross validation or [13] K. Randhawa, C. K. Loo, M. Seera, C. P. Lim, and A. K. Nandi,
Credit Card Fraud Detection Using AdaBoost and Majority
comparison across multiple datasets to verify the Voting. IEEE Access, vol. 6, no. 1, pp. 14277-14284, 2018.
consistency of the data-point approach in handling [14] A. Chouiekha, E. Hassane, and E. Haj, ConvNets for Fraud
imbalance credit card fraud datasets. Further studies can Detection analysis. Procedia Computer Science, vol. 127, no. 1,
investigate building and deploying a real-time solution pp133–138, 2018. https://doi.org/10.1016/j.procs.2018.01.107
that can detect fraud as and when the transaction is [15] G. Rekha, A. K. Tyagi, and R. V. Krishna, Solving Class
occurring. Imbalance Problem Using Bagging, Boosting Techniques, with
and Without Using Noise Filtering Method. International Journal
of Hybrid Intelligent Systems, vol. 15, no. 2, pp. 67–76, 2019.
DOI: 10.3233/HIS-190261
ACKNOWLEDGMENT
[16] R. A. Bauder and T. M. Khoshgoftaar, The effects of varying class
KIND ACKNOWLEDGMENT TO THE DURBAN distribution on learner behaviour for Medicare fraud detection
with imbalanced big data. Health Information Science and
UNIVERISTY OF TECHNOLOGY FOR PROVIDING THE Systems, vol. 6, no. 9, pp. 1-14, 2018. doi: 10.1007/s13755-018-
RESOURCES FOR THIS RESEACH STUDY. 0051-3
[17] X. Liu, J. Wu, and Z. Zhou, Exploratory Undersampling for
REFERENCES Class-Imbalance Learning. IEEE Transactions On Systems, Man,
[1] D. Robertson, The Nelson Report. October. Available online: And Cybernetics, vol. 39, no. 2, pp. 539-550, 2009. Doi:
10.1.1.309.1465
http://www.nelsonreport.com/upload/content_promo/The_Nelson
_Report_10-17-2016.pdf (accessed on 03 February 2019). [18] S. H. Ebenuwa, S. Sharif, and M. Alazab, Variance Ranking
[2] D. Huang, D. Mu, L. Yang, and X. Cai, CoDetect: Financial Fraud Attributes Selection Techniques for Binary Classification Problem
Detection with Anomaly Feature Detection. National Natural in Imbalance Data. IEEE Access, vol. 7, no. 1, pp. 24649-24666,
Science Foundation of China, vol. 6, no. 2, pp. 19161-19174, 2019.
2018. DOI: 10.1109/ACCESS.2018.2816564 [19] L. S. Feldt, A comparison of the precision of three experimental
[3] C. Jiang, J. Song, G. Liu, L. Zheng, and W. Luan, Credit Card designs employing a concomitant variable. Psychometrika, vol.
Fraud Detection: A Novel Approach Using Aggregation Strategy 23, no. 1, pp. 335-353, 1958. DOI:
and Feedback Mechanism. IEEE Internet of Things Journal, vol. https://doi.org/10.1007/BF02289783
5, no. 5, pp. 3637-3647, 2018. DOI:10.1109/JIOT.2018.2816007 [20] E. Lejon, P. Kyosti, and J. Lindstrom, Machine learning for
[4] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, Handling detection of anomalies in press-hardening: Selection of efficient
imbalanced datasets: A review. GESTS International Transactions methods. Process IT Innovations R&D Centre, vol. 1, no. 1, pp.
1079-1083, 2018. https://doi.org/10.1016/j.procir.2018.03.221
on Computer Science and Engineering, vol.30, no. 1, pp. 25-36,
2006. DOI: https://doi.org/10.1007/s13369-016-2179-2 [21] G. Baader, and H. Krcmar, Reducing false positives in fraud
[5] A. O. Adewumi, and A. A. Akinyelu, A survey of machine - detection: Combining the red flag approach with process mining.
learning and nature-inspired based credit card fraud detection International Journal of Accounting Information Systems, vol. 31
techniques. Int J Syst Assur Eng Manag, vol. 8, no. 2, pp. 937- no. 1, pp. 1–16, 2018. https://doi.org/10.1016/j.accinf.2018.03.004
953, 2017. https://doi.org/10.1007/s13198-016-0551-y [22] I. Sadgali, N. Sael, and F. Benabbou, Performance of machine
[6] Y. Bian, M. Cheng, C. Yang, Y. Yuan, Q. Li, and J. L. Zhao et al., learning techniques in the detection of financial frauds. Procedia
Financial fraud detection: a new ensemble learning approach for Computer Science, vol. 148, no. 1, pp. 45-54, 2018.
imbalanced data. PACIS 2016 Proceedings, vol. 315, no. 1, pp. 1- https://doi.org/10.1016/j.procs.2019.01.007
11, 2016. [23] K. Jiang, J. Lu, K. Xia, and L. Zheng, A Novel Algorithm for
[7] A. T. Elhassan, M. Aljourf, F. Al-Mohanna, and M. Shoukri, Imbalance Data Classification Based on Genetic Algorithm
Classification of Imbalance Data using Tomek Link (T-Link) Improved SMOTE. Arab J Sci Eng, vol. 41, no. 1, pp. 3155-3266,
Combined with Random Under-sampling (RUS) as a Data 2016. DOI: https://doi.org/10.1007/s13369-016-2179-2
Reduction Method. Global J Technol Optim, vol. 1, no. 1, pp. 1- [24] N. Malini, and M. Pushpa, Analysis on Credit Card Fraud
11, 2017. DOI: 10.4172/2229-8711.S1111 Identification Techniques based on KNN and Outlier Detection.
[8] K. Sotiris, K. Dimitris, and P. Panayiotis, Handling imbalanced Advances in Electrical, Electronics, Information, Communication
datasets: A review. GESTS International Transactions on and Bio-Informatics (AEEICB), vol. 1, no. 1, pp 1-12, 2017. DOI:
Computer Science and Engineering, vol. 30, no. 1, pp. 1-12, 2016. 10.1109/AEEICB.2017.7972424
[9] M. Zanin, M. Romance, S. Moral, and R. Criado, Credit card [25] E. M. Hassib, A. I. El-Desouky, E. M. El-Kenawy, and S. M.
fraud detection through parenclitic network analysis. IEEE Ghamrawy, Imbalanced Big Data Mining Framework for
Access, vol. 1, no. 1, pp. 1-8, 2017. Improving Optimization Algorithms Performance. IEEE Access,
https://doi.org/10.1155/2018/5764370 vol. 7 no. 1, pp. 170774-170795, 2019.
[10] S. Patil, V. Nemade, and P. Kumar, Predictive Modelling for [26] A. Somasundaran, and U. S. Reddy, Data Imbalance: Effects and
Credit Card Fraud Detection Using Data Analytics. Computational Solutions for Classification of Large and Highly Imbalanced Data.
Intelligence and Data Science, vol. 132, no. 1, pp. 385–395, 2018. Proc. of 1st International Conference on Research in Engineering,
https://doi.org/10.1016/j.procs.2018.05.199 Computers and Technology, vol. 25, no. 10, pp. 28- 34, 2016.
[11] M. Albashrawi, Detecting Financial Fraud Using Data Mining [27] A. D. Pozzolo, G. Boracchi, O. Caelen, C. Alippi, and G.
Techniques: A Decade Review from 2004 to 2015. Journal of Bontempi, Credit Card Fraud Detection: A Realistic Modelling
Data Science, vol. 14, no. 1, pp. 553-570, 2016. and a Novel Learning Strategy. Transactions on neural networks
and learning systems, vol. 29, no. 8, pp. 3784-3797, 2018. DOI:
[12] R. A. Bauder, and T. M. Khoshgoftaar, The effects of varying 10.1109/TNNLS.2017.2736643
class distribution on learner behavior for medicare fraud detection
with imbalanced big data. Health Information Science and [28] A. Mubalik, and E. Adali, Multilayer Perception Neural network
Systems, vol, 6 no. 9, pp. 1-14, 2018. technique for fraud detection. Computer Science and Engineering
(UBMK), vol. 1, no. 1, pp. 383-387, 2017.
http://journals.uob.edu.bh
286 Nhlakanipho Mqadi, et. al.: A SMOTe based Oversampling Data-Point Approach to …
http://journals.uob.edu.bh