0% found this document useful (0 votes)
30 views10 pages

Menakshi Satwinder

This document summarizes a research article that compares and analyzes logistic regression, Naive Bayes, and KNN machine learning algorithms for credit card fraud detection. The research implements these three algorithms in Python and measures their performance based on accuracy, sensitivity, specificity, precision, F-measure, and AUC. Logistic regression was found to perform better than the other models, particularly when applied to data preprocessed using random under-sampling techniques for imbalanced datasets.

Uploaded by

Zubia F
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views10 pages

Menakshi Satwinder

This document summarizes a research article that compares and analyzes logistic regression, Naive Bayes, and KNN machine learning algorithms for credit card fraud detection. The research implements these three algorithms in Python and measures their performance based on accuracy, sensitivity, specificity, precision, F-measure, and AUC. Logistic regression was found to perform better than the other models, particularly when applied to data preprocessed using random under-sampling techniques for imbalanced datasets.

Uploaded by

Zubia F
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/339294125

Comparison and analysis of logistic regression, Naïve Bayes and KNN machine
learning algorithms for credit card fraud detection

Article  in  International Journal of Information Technology · February 2020


DOI: 10.1007/s41870-020-00430-y

CITATIONS READS

106 1,409

3 authors, including:

Meenakshi Mittal Satwinder Singh


Panjab University Central University of Punjab
15 PUBLICATIONS   188 CITATIONS    58 PUBLICATIONS   546 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Threshold Designing of Software metrics View project

All content following this page was uploaded by Satwinder Singh on 12 February 2022.

The user has requested enhancement of the downloaded file.


Int. j. inf. tecnol.
https://doi.org/10.1007/s41870-020-00430-y

ORIGINAL RESEARCH

Comparison and analysis of logistic regression, Naı̈ve Bayes


and KNN machine learning algorithms for credit card fraud
detection
Fayaz Itoo1 • Meenakshi2 • Satwinder Singh2

Received: 5 November 2019 / Accepted: 22 January 2020


 Bharati Vidyapeeth’s Institute of Computer Applications and Management 2020

Abstract Financial fraud is a threat which is increasing on other prediction models developed from Naı̈ve Bayes and
a greater pace and has a very bad impact over the economy, K-nearest neighbour. Better results are also seen by
collaborative institutions and administration. Credit card applying under sampling techniques over the data before
transactions are increasing faster because of the advance- developing the prediction model.
ment in internet technology which leads to high depen-
dence over internet. With the up-gradation of technology Keywords Credit card fraud  Fraud detection  Random
and increase in usage of credit cards, fraud rates become under-sampling  Logistic regression  Naı̈ve Bayes  KNN
challenge for economy. With inclusion of new security
features in credit card transactions the fraudsters are also
developing new patterns or loopholes to chase the trans- 1 Introduction
actions. As a result of which behavior of frauds and normal
transactions change constantly. Also the problem with the Fraud are typically outlined as criminal duplicity with
credit card data is that it is highly skewed which leads to purpose of obtaining gain. With the fast growing depen-
inefficient prediction of fraudulent transactions. In order to dence on internet technology, the rate at which credit card
achieve the better result, imbalanced or skewed data is pre- frauds happen has also increased at an alarming rate.
processed with the re-sampling (over-sampling or under Almost all modes of transactions be it online or offline are
sampling) technique for better results. The three different made via credit cards. External credit card fraud detection
proportions of datasets were used in this study and random enjoys the inclination of majority of research work. There
under-sampling technique was used for skewed dataset. are two types of credit card frauds; one is inner card fraud
This work uses the three machine learning algorithms and the other is external card fraud. Inner card fraud takes
namely: logistic regression, Naı̈ve Bayes and K-nearest place when a false individuality is used to commit fraud
neighbour. The performance of these algorithms is recor- because of mutual accord between cardholders and bank on
ded with their comparative analysis. The work is imple- the other hand fraud that is categorised as external consists
mented in python and the performance of the algorithms is of taking the credit card to induce cash through dubious
measured based on accuracy, sensitivity, specificity, pre- means [1]. Credit card fraud might be considered as vital
cision, F-measure and area under curve. On the basis these issue and amounts to a huge worth for banking organisa-
measurements logistic regression based model for predic- tions and card establishment companies. With this colossal
tion of fraudulent was found to be a better in comparison to disadvantage present in transaction system, banking
organisations grade credit card fraud as a grave issue, to
curb the menace they have fully-fledged security systems
& Satwinder Singh
[email protected]
to keep a check on transactions and spot the frauds as
quickly as possible upon conceived. Fraud detection is
1
Central University of Punjab, Bathinda, India necessary so that we can impede the impact of dubious
2
A.P. Department of Computer Science and Technology, transactions on services of delivery, costs, and company
Central University of Punjab, Bathinda, India name. Due to use of Machine learning, it has been helpful

123
Int. j. inf. tecnol.

in finding variety of the mandatory business issues such as Learning Algorithms (Random forest, Decision trees,
detecting email spam, targeted product recommendation, Support Vector Machines, Bayesian Networks, MLP,
correct diagnosing etc. The promotion of machine learning Naı̈ve Bayes and many more). Review of some of the
has been attributed to the increasing process power, research papers related to credit card fraud detection and its
availableness of huge information and improvement in prevention is as follows:
statistical modelling [2]. The most difficult thing for banks Padvekar et al. [8] demonstrated that credit card mis-
and commerce industry is the fraud management. The representation are frequently distinguished utilizing hidden
quantity of transactions has multiplied because of associate markov model all through transactions. hidden markov
excessive number of payment channels—credit/debit cards, model gets a high misrepresentation inclusion joined with a
smartphones, and kiosks. At the same time, criminals low false alert rate. They utilized the scopes of exchange
became adept at finding loopholes. Hence it is not easy to amount as the perception images, while the classifications
authenticate transactions or it is very tough for businesses of items are contemplated to be conditions of the HMM.
to authenticate transactions [3]. They arranged a strategy for finding the expense profile of
Data researchers have been quite successful in resolving cardholders, correspondingly as use of this data in deciding
this problem with machine learning and predictive ana- the value of observation symbols and estimate of the model
lytics. The problem that comes with the credit card fraud parameters. It’s is also explained that how HMM is able to
detection is the skewed or unbalanced data and the algo- detect approaching transaction as fraudulent or not. Rela-
rithms treat the minority categories as a noise and only tive investigations uncover that the Accuracy of the
predicts the majority category accurately not the minority framework is on the precarious edge of 80% over an
category [4, 5]. So in the case of a skewed data there are extensive variety inside the data. The framework is fur-
various resampling techniques which can be applied on a thermore ascendable for taking care of huge volumes of
skewed or imbalanced data and can produce better results. transactions.
The imbalanced problem of the dataset can also be solved Khare and Sait [9] examined and checked the presen-
by a technique called ensemble learning framework which tation of Decision Tree, Random Forest, SVM and Logistic
guarantees the uprightness of test features and the method Regression classifier algorithms. The methods were used
depends on training set split and congregate [6]. In order to on the raw and pre-handled information. From the inves-
overcome with the problem of the false alarm rates and tigations the outcome that has been finished up is that
increase the efficiency of credit card fraud detection rate Logistic regression has exactness of 97.7% while SVM
various approaches like outlier detection methods have indicates exactness of 97.5% and Decision tree demon-
been used [7]. These approaches can optimise the best strates exactness of 95.5% yet the best outcomes are
solution and their implementation on bank credit card fraud acquired by Random forest with an exact precision of
detection system (CCFDS) are very useful in detecting and 98.6%. The outcomes acquired therefore reason that Ran-
preventing the fraudulent transaction [7]. dom forest demonstrates the most precise and high accu-
racy of 98.6% in issue of credit card fraud detection with
dataset given by ULB.
2 Related work Banerjee et al. [10] examined the various machine
learning classifiers trained on a public dataset to analyse
Credit card fraud is on the rise as the number of online correlation of certain factors with fraudulence. The better
transactions are increasing. In order to prevent fraudulent metrics are used to determine false negative rates and the
transactions and to detect the credit card fraud there should performance of random sampling was measured to deal
be the most effective methods which are able to detect and with the class imbalance of the dataset. The support vector
prevent fraudulent transactions before they make a huge machine performed better for detecting credit card fraud
loss to the banks and credit card holders. There are various under realistic conditions. The comparison between the
methods of fraud detection which are proposed by deep learning and regression algorithmic models is done to
researchers and are somehow effective in credit card fraud determine which algorithm and combination of factors
detection but the real problem lies in availability of data- provides the most accurate method of classifying a credit-
sets due to security issues and datasets are very imbal- card transaction as fraudulent or non-fraudulent. The best
anced. Some of the methods proposed for credit card fraud algorithm for analysis of datasets with a close to 1:1 ratio
detection are Neural Networks, Fusion of Dempster Shafer, of fraudulent to non-fraudulent transactions is the Random
Bayesian Learning, Hidden Markow Model, Fuzzy Dar- Forest Classifier, assuming the fraud-to-not fraud distri-
winian System, Outlier detection methods, Support Vector bution of the testing and training set is the same.
Machines, Genetic Algorithm, Covering Algorithm, Meta- Mishra and Ghorpade [4] analysed various classification
Classifiers, Data Mining, ensemble Learning, machine techniques using various metrics for evaluating various

123
Int. j. inf. tecnol.

classifiers. The models were trained based on various Testing datasets are assessed using trained model of the
classification and ensembling techniques. The models used classifiers. Step by step methodology has been explained as
were Logistic regression, Decision tree, Random Forest, follows. Figure 1 explains the flow diagram of research
Support Vector machines and various ensembling models. work.
These models were trained and the results were obtained.
Results obtained from the actual dataset were also good 3.1 Collection of dataset and pre-processing
and with the recall of about 96% the Random Forest
classifier performed better as compared to other classifiers. The dataset is acquired from the Kaggle which hosts the
Xuan et al. [11] utilized two sorts of random forest dataset from credit card fraud detections [12]. The dataset
algorithms to prepare the features of typical and strange is crafted from the MasterCard transactions of European
transactions. The two arbitrary random forest algorithms cardholders on Sept 2013. The transactions that occurred
utilized are thought about which are distinctive in their for 2 days were recorded that amounts to 284,807 entries.
base classifiers and their presentation is examined on credit The positive category (fraud cases) conjure 0.172% of the
card fraud detection. The two algorithms utilized are ran- transactions information. The features are transformed and
dom-tree-based random forests and CART-based random are reduced to 28 principal components as PCA is applied
forest whose preparation set originates from bootstrapped on them and are transformed into numerical input values.
tests. The three experiments were performed for the two These principal components are named as V1, V2, V3 …
algorithms on different datasets with different proportions and V28. The features include credit limit, gender, marital
of datasets. The performance of these algorithms were status, previous months bills, previous months payments,
measured for all the three experiments and the metrics status of existing account, salary assignments, credit his-
which were added are intervention rate of transaction and tory, other credits existing, purpose, credit amount, present
average rate of model. Cart based random forest performed employment, savings account, personal status, other debt-
better in all the experiments performed. ors, property, age in months, Housing, number of existing
credits, Job, Telephone, foreign worker, ID, Credit card
number, PIN, Time, Amount and Class. From the statistics
3 Methodology of total entries and fraud cases it can be inferred that the
dataset is very unbalanced and is inclined towards the
In this research work the machine learning classifiers
namely: logistic regression (LR), K-nearest neighbour
(KNN) and Naı̈ve Bayes (NB)are put to application with
Python serving as the language of implementation. The
experiments are carried out and the evaluation of these
experiments is done using the confusion matrix and per-
formance comparison of the algorithms is analysed with
the help of measures namely: accuracy, sensitivity, speci-
ficity, precision, F-measure and area under curve (AUC).
There are various stages which are involved in creating
and processing of classifiers which include; gathering of
data, pre-processing of data, training of algorithms, testing
of algorithms and analysis of classifiers.
At the time of pre-processing of data, the data is trans-
formed into viable format and is then sampled using the
sampling techniques. A technique called random under-
sampling is carried out on a dataset because of the highly
imbalanced dataset which is more biased towards the
negative cases (non-fraud cases) and due to the under-
sampling of dataset, three sets of data distribution is
achieved. The features selected are the principal compo-
nents and these components are actually the product of
principal component analysis dimensionality reduction
resulting in 28 principal components which are represented
as V1, V2 …, and V28. During the training stage the
algorithms are given the input as the processed data. Fig. 1 Flow diagram of research work

123
Int. j. inf. tecnol.

negative class. The background details of the features are 3.3 Classification techniques
hidden and cannot be shown due to privacy issues. The
time contains the seconds passed between every exchange In the dataset of credit cards there are two values for
and the primary exchange in the dataset. The ‘Amount’ classification of transactions which means that it is a binary
feature is the exchange amount. The ‘class’ feature is used classification problem where transactions are classified
to represent whether the transaction is fraud or non-fraud either as fraud (1) or non-fraud (0). After resampling of the
and for the class value of 1 it represents the fraud trans- data by under-sampling, the classifiers are trained using the
action in the dataset and for the class value of 0 it repre- training data to evaluate the methods. In this study classi-
sents the non-fraud transactions. fication techniques named as: logistic regression (LR),
Naı̈ve Bayes (NB) and K-nearest neighbour (KNN) are
3.2 Under-sampling of dataset used.

To cope with unbalanced datasets, modifying classification 3.4 Dataset division


algorithms in order to gain improvements or equalisation of
classes within the training information conjointly known as The credit card dataset is split into two halves one training
data pre-processing is needed. Pre-processed information is set and other testing set. In this study we chose the ratio
provided as input to the machine learning rule because of 50:50, 34:66 and 25:75 (fraud: non fraud). Figure 3 below
its wide use. The chief target of knowledge pre-processing shows the overview of dataset division, which is split as
is increasing the number of the minority category or training and testing and resampling (random under-sam-
decreasing the frequency of the bulk class. This can be pling) is done. Also Tables 1, 2 and 3 shows more details
done with the aim of achieving same variety of instances about the dataset division.
for each the categories. In Table 1 division of the dataset is done by the ratio
Random under-sampling (RUC) is a widely used 50:50; it means that same number of fraud and non-fraud
resampling method and as such is used in our study. The instances have been taken to train the three classification
choice of RUC is made on basis of its simplicity and techniques.
effectiveness. The aim of RUC is to adjust class dispersion In Table 2 division of the dataset (for fraud and non-
by means of arbitrarily dispensing with dominant part class fraud instances) is done in the ratio of 34:66; these numbers
precedents. The procedure is done until the greater part and of instances have been taken to train the three classification
minority occurrences are adjusted. RUC improves run time techniques.
and capacity issues due to decreasing the quantity of In Table 3 division of the dataset (for fraud and non-
preparing information tests in huge datasets. The lone fraud instances) is done in the ratio of 25:75; these numbers
limitation RUC suffers from is loss of some important of instances have been taken to train the three classification
information. In this study the dataset is distributed in three techniques.
proportions taken as (fraud: non-fraud) ratio and the three
proportions are: 50:50, 34:66 and 25:75. The results are
taken for random under-sampling method for all the data
distributions taken (Fig. 2).

Fig. 2 Random under-sampling working of the dataset Fig. 3 Division of dataset

123
Int. j. inf. tecnol.

Table 1 DIvision of dataset by ratio 50:50 3.5 Performance evaluation


Data division Training data Resampling method
RUS Performance evaluations were done for the three different
classification techniques namely LR, NB and KNN for the
Fraud 492 344 resampling technique (RUS) used. The four elementary
Non-fraud 284,315 344 matrices through which performance evaluations are pre-
Total 284,807 688 dicted are as: True Positive (TP), True negative (TN), False
positive (FP) and false Negative (FN). True positives are
the cases which are predicted as positive and in reality they
are positive as well. True negatives are cases anticipated
Table 2 Division of dataset by ratio 34:66
appropriately as negative. False positive are cases antici-
Data division Training data Resampling method pated as positive yet are negative cases. False negative are
RUS
cases delegated negative yet are actually positive. The
Fraud 492 341 correlations between these metrics is given in an exceed-
Non-fraud 284,315 692 ingly confusion metrics. Additionally the achievement of
Total 284,807 1033 three algorithms are compared in terms of sensitivity,
specificity, accuracy, F-measure and area under curve
(AUC). The metrics used are calculated using the confu-
sion metrics as shown in the Table 5 below.
Table 3 Division of dataset by ratio 25:75 Accuracy Accuracy is defined as the ratio of total
Data division Training data Resampling method number of predicted transactions that are correct [13]
RUS TP þ TN
Accuracy ¼ ð1Þ
Fraud 492 353 TP þ FP þ TN þ FN
Non-fraud 284,315 1024 Sensitivity The proportion of positive observed values
Total 284,807 1377 correctly predicted as positive. It is also called as True
Positive Rate (TPR) [13]
TP
Sensitivity ðRecallÞ ¼ ð2Þ
Since 30% of dataset is used for testing the models and TP þ FN
after resampling (random under-sampling) the number of Specificity Specificity is defined as, with how much
fraud and non-fraud cases in the testing data for the three accuracy the negative (legitimate) cases are classified and
different proportions is given in Table 4. in our caseit gives the accuracy on prediction of legitimate
Now onwards we will use A = 50:50, B = 34:66, transactions classification. It is also called as True Negative
C = 25:75. Rate (TNR) [13]
After selection of the training and testing datasets, three TN
different classification techniques namely LR, NB and Specificity ¼ ð3Þ
FP þ TN
KNN have been trained using the training dataset and we
get the corresponding three models. Then testing dataset Precision The proportion of positive (fraud) predictions
have been tested using these three models and then per- that are actually correct [13].
formance evaluation has been done. TP
Precision ¼ ð4Þ
TP þ FP
F-measure F-measure gives the accuracy of the test
which means that it gives the accuracy of experiments
performed. It uses the both precision and recall to compute

Table 4 Preparation of testing dataset


Data proportion Fraud Non-fraud Total Table 5 Confusion matrix of credit card dataset
Predicted fraud Predicted non-fraud
50:50 35 261 296
34:66 137 306 443 Actual fraud TP FN
25:75 141 450 591 Actual non-fraud FP TN

123
Int. j. inf. tecnol.

its value. The best value for f1 score is considered at value proportions as compared to other two algorithms in Fig. 6.
1 [14]. In 25:75 ratio represented by ‘C’ all the algorithms per-
Precision:Recall formed better in term of accuracy measurement. So this
F-measure ¼ 2: ð5Þ split or ratio is considered to be the best for further training
Precision þ Recall
and testing purpose. Same results with better performance
Area under curve (AUC) AUC represents degree or of Logistic Regression can be seen for other performance
measure of separability that is how much model is capable measurement parameters (precision and F-measure) in
of differentiating between the classes [14]. Figs. 7 and 8 respectively.
1
AUC ¼ :ðSensitivity þ SpecificityÞ ð6Þ
2 4.1 Analysis of algorithms for the three proportions

The values of parameters for the three proportions are


depicted in Tables 6, 7 and 8. The Logistic Regression
4 Result and discussion
shows higher values for all the parameters because it
maximizes the conditional data likelihood function. The
This part deals with the results gathered during experi-
provisional data likelihood is the probability of the noticed
ments. In the Tables 6, 7 and 8 given below, the compar-
Y values in the training data, constrained on their respec-
ison results of all the three classifiers in the resampling
tive X values. The second reason is that the feature values
technique used for the ratios; 50:50, 34:66 and 25:75
in Logistic Regression are dependent and there is a much
respectively are shown. Parameters chosen for comparison
more correlation between these features which contribute
of results are sensitivity, specificity, accuracy, precision,
to the prediction of new data point. Logistic Regression
F-measure and AUC. From the results obtained it is clear
also shows the higher accuracy (91.2%, 92.3% and 95.9%
that in all the three proportions logistic regression (LR)
in case of A, B and C ratios respectively) for a medium size
dominates with higher accuracy for the random under-
dataset and it is able to estimate the patterns for the fraud
sampling method used.
data in the balanced dataset. Also, in this study data bal-
(i) Comparison of classification techniques by ratio A ancing was done by under sampling methodology for the
(ii) Comparison of classification techniques by ratio B fraud detection. Above might be the reasons for better
performance of each parameter (sensitivity, specificity,
The value of the parameters for the three classifiers for
precision, F-measure and AUC) in logistic regression The
the ratio B is given in Table 7.
decision boundary is set by the maximum conditional data
(iii) Comparison of classification techniques by ratio C likelihood function.
As noticed in Tables 6, 7 and 8 the Naı̈ve Bayes algo-
The sensitivity comparison of each proportion is shown
rithm shows lower accuracy than the logistic regression
in Fig. 4. Which represent the better result for logistic
and this might be due to the reason that the features are
regression in comparison to other techniques Naı̈ve Bayes
independent of each other and each feature for the Naı̈ve
and KNN.
Bayes classifier contributes individually for the prediction
Specificity comparison of each proportion is represented
of new data point. Secondly, the features are not correlated
graphically in Fig. 5 which shows parallel results for
and this supposition dramatically decreases the number of
logistic regression and Naı̈ve Bayes for each proportion of
parameters that must be estimated to learn the classifier and
under sampling ratio A, B and C
this might be the reason algorithm shows sometime lower
As shown in Figs. 6, 7 and 8 in all the three proportions,
performance values for sensitivity, specificity, accuracy,
logistic regression algorithm shows the better result. It
precision, F-Measure and AUC compared to the logistic
gives the highest accuracy in all the three proportions as
regression. For example sensitivity measure for Naı̈ve
compared to Naı̈ve Bayes and KNN (Fig. 6). However
Bayes is 0.757, 0.718 and 0.664 as compared to 0.878,
KNN showed the lowest accuracy in all the three
0.777 and 0.839 Sensitivity measure of logistic regression

Table 6 Comparison of
Resampling method: random under-sampling
classification techniques by
ratio A Techniques Sensitivity Specificity Accuracy Precision F-measure AUC

Logistic regression 0.878 0.949 0.912 0.951 0.913 0.914


Naı̈ve Bayes 0.757 0.964 0.854 0.959 0.846 0.860
K-nearest neighbour 0.687 0.669 0.679 0.701 0.694 0.678

123
Int. j. inf. tecnol.

Table 7 Comparison of
Resampling method: random Under-Sampling
classification techniques by
ratio B Techniques Sensitivity Specificity Accuracy Precision F-Measure AUC

Logistic regression 0.777 1.0 0.923 1.0 0.875 0.888


Naı̈ve Bayes 0.718 1.0 0.902 1.0 0.836 0.859
K-nearest neighbour 0.477 0.789 0.681 0.544 0.508 0.633
Bold values highlight the parameters
Table 8 Comparison of
Resampling method: random under-sampling
classification techniques by
ratio C Techniques Sensitivity Specificity Accuracy Precision F-Measure AUC

Logistic regression 0.839 0.997 0.959 0.991 0.909 0.918


Naı̈ve Bayes 0.664 0.995 0.915 0.979 0.789 0.829
K-nearest neighbour 0.405 0.861 0.751 0.483 0.441 0.633
Bold values highlight the parameters

Sensitivity Accuracy
1 1.2
0.9
1
0.8
0.7 0.8 Logistic
Logistic
0.6 Regression Regression
0.5 Naïve Bayes 0.6 Naïve Bayes
0.4
K-Nearest 0.4 K-Nearest
0.3 Neighbour Neighbour
0.2 0.2
0.1
0 0
A B C A B C

Fig. 4 Sensitivity Fig. 6 Accuracy

Specificity
Precision
1.2
1.2

1 1

0.8 0.8 Logistic Regression


Logistic
Regression
0.6 Naïve Bayes
0.6 Naïve Bayes
0.4 K-Nearest
0.4 K-Nearest Neighbour
Neighbour
0.2

0.2
0
A B C
0
A B C Fig. 7 Precision

Fig. 5 Specificity

123
Int. j. inf. tecnol.

F-Measure respectively. Table 6 clearly shows better performance for


1 all the metrics (else than Accuracy and Precision) as
0.9 compared to other proportions shown in Tables 7 and 8.
0.8 The KNN showed the accuracy of 75% for 50:50 propor-
0.7 Logistic Regression tion which is better than the other proportions as it dif-
0.6 ferentiates the classes more accurately when the training
0.5 Naïve Bayes data increases. This is clear that when the training data is
0.4 increased the algorithms are performing better and this
0.3 K-Nearest shows that the data proportion of 25:75 is better for training
Neighbour
0.2 the classifiers. For all the proportions taken LR performs
0.1 very well with at most 95% correctness. The accuracy of
0 all the classifiers for all the three data proportions is shown
A B C
in the Fig. 6.
Fig. 8 F-measure

for respectively three different ratios of 50:50, 34:66 and 5 Conclusion and future work
25:75. Also Naı̈ve Bayes algorithm has a greater bias but
lower variance than logistic regression which might be The research work was carried out with the purpose of
support the under sampling methodology of data balancing. comparing the ability of machine learning algorithms as to
K-nearest neighbour requires a distance or measure the how accurately they differentiate and classify the fraud and
separation characterized between two information. In pro- non-fraud transactions of the credit card dataset with ran-
cedure of KNN, it characterizes any approaching transac- dom under sampling method (RUS) and to check out if the
tion by ascertaining separation of closest point to new performance is improved or not. Logistic Regression (LR)
approaching transaction. At that point if the closest showed the optimal performance for all the data propor-
neighbour be deceitful, then the transaction demonstrates tions as compared to Naı̈ve Bayes (NB) and K-Nearest
as a fraud. The estimation of K is utilized as, a little and Neighbour (KNN). LR was successful in getting higher
odd to break the ties (normally 1, 3 or 5). Bigger K values accuracy as compared to Naı̈ve Bayes and KNN. The LR
can lessen the impact of boisterous dataset. In this algo- showed the maximum accuracy of 95%, NB showed 91%
rithm, distance between two information instances is and KNN 75%. Also LR technique shows the better Sen-
determined utilizing Euclidean distance. For multivariate stivity, Specificity, Precision and F-Measure as compare to
information, distance is typically determined for each NB and K-NN technique. It has also been observed that
instance and after that consolidated. The algorithm shows being a supervised techniques (LR and Naı̈ve Bayes) shows
the poor accuracy for the proportion C (50:50) in visual- a better results in each case as compared to un-supervised
ising the results from bar graphs in Figs. 4, 5, 6, 7 and 8 technique K-NN.
and this is due to the small sample of training data as there There can be other resampling methods as well which
is much more similarity between the fraud and non-fraud could be put to application for the skewed dataset for credit
cases and the algorithm does not efficiently differentiates card fraud detection (CCFD). The resampling methods
the patterns in fraud and non-fraud cases. could be improved to get better results. Also using our
As it is clear from the results of the experiments that the statistics could be compared with the other techniques like
accuracy of classifiers increases as the training data is Random-Forest, SVC, Decision-Tress, Neural Network and
increased. Table 6 summarises the information for the ratio Genetic Algorithm. The main limitation of Random Under-
34:66 for the Random Under-sampling (RUS). In this sampling is that some information could be lost and new
proportion logistic regression and Naı̈ve Bayes have the resampling methods could be devised for achieving opti-
same specificity rate which is 1.0 and it means that both the mal results which can prove helpful in credit card fraud
classifiers classified the negative cases (non-fraud) with detection (CCFD) in future. Likewise our results might be
100% accuracy. This might be due to reason, as the training useful and can offer further help to the association to
data sample increases the accuracy of both the classifiers assemble a vastly improved credit card fraud detection
increases. Since, both the algorithms estimate the priori system (CCFDS) which can be better in dealing with the
probability which increases with number of samples in the skewed information and utilize the better measurements to
training data and thus helps in classifying the data samples assess the outcomes.
more accurately. It is depicted by the AUC values which is
0.89 and 0.85 for logistic regression and Naı̈ve Bayes

123
Int. j. inf. tecnol.

References Proceedings of the 3rd IEEE international conference on advan-


ces in electrical and electronics, information, communication and
1. Kundu A, Panigrahi S, Sural S, Majumdar AK (2009) BLAST- bio-informatics, AEEICB 2017, pp 255–258
SSAHA hybridization for credit card fraud detection. IEEE Trans 8. Padvekar SA, Kangane PM, Jadhav KV (2016) Credit card fraud
detection system. Int J Eng Comput Sci 5(4):16183–16186
Dependable Secure Comput 6(4):309–315
2. Guo T, Li G-Y (2008) Neural data mining for credit card fraud 9. Khare N, Sait SY (2018) Credit card fraud detection using
detection. In: International conference on machine learning and machine learning models and collating machine learning models.
cybernetics Int J Pure Appl Math 118(20):825–838
10. Banerjee R, Bourla G, Chen S, Kashyap M, Purohit S, Battipaglia
3. Ghobadi F, Rohani M (2016) Cost sensitive modeling of credit
card fraud using neural network strategy. In: International con- J (2018) Comparative analysis of machine learning algorithms
ference of signal processing and intelligent systems (ICSPIS) through credit card fraud detection. New Jersey’s Governor’s
4. Mishra A, Ghorpade C (2018) Credit card fraud detection on the School of Engineering and Technology, Piscataway, pp 1–10
skewed data using various classification and ensemble tech- 11. Xuan S, Liu G, Li Z, Zheng L, Wang S, Jiang C (2018) Random
niques. In: 2018 IEEE International students’ conference on forest for credit card fraud detection. In: ICNSC 2018—15th
electrical, electronics and computer science, SCEECS 2018 IEEE International conference on networking, sensing and con-
5. Raj SE, Portia AA (2011) Analysis on credit card fraud detection trol, pp 1–6
methods. In: International conference on computer, communica- 12. Hordri NF, Yuhaniz SS, Firdaus N, Azmi M, Shamsuddin SM
tion and electrical technology (2018) Handling class imbalance in credit card fraud using
6. Wang H, Zhu P, Zou X, Qin S (2018) An ensemble learning resampling methods. Int J Adv Comput Sci Appl 9(11):390–396
13. Awoyemi JO, Adetunmbi AO, Oluwadare SA (2017) Credit card
framework for credit card fraud detection based on training set
partitioning and clustering. In: 2018 IEEE SmartWorld, Ubiquitous fraud detection using machine learning techniques: a comparative
Intelligence & Computing, Advanced & Trusted Computing, Scal- analysis. In: Proceedings of the IEEE international conference on
able Computing & Communications, Cloud & Big Data Computing, computing, networking and informatics, ICCNI 2017, vol 2017–
Internet of People and Smart City Innovation (SmartWorld/SCAL- Jan, pp 1–9
COM/UIC/ATC/CBDCom/IOP/SCI), Guangzhou, pp 94–98. 14. Hordri NF, Yuhaniz SS, Azmi NFM, Shamsuddin SM (2018)
https://doi.org/10.1109/SmartWorld.2018.00051 Handling class imbalance in credit card fraud using resampling
7. Malini N, Pushpa M (2017) Analysis on credit card fraud iden- methods. Int J Adv Comput Sci Appl 9(11):390–396
tification techniques based on KNN and outlier detection. In:

123

View publication stats

You might also like