0% found this document useful (0 votes)

30 views10 pages

Menakshi Satwinder

This document summarizes a research article that compares and analyzes logistic regression, Naive Bayes, and KNN machine learning algorithms for credit card fraud detection. The research implements these three algorithms in Python and measures their performance based on accuracy, sensitivity, specificity, precision, F-measure, and AUC. Logistic regression was found to perform better than the other models, particularly when applied to data preprocessed using random under-sampling techniques for imbalanced datasets.

Uploaded by

Zubia F

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views10 pages

Menakshi Satwinder

Uploaded by

Zubia F

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/339294125

Comparison and analysis of logistic regression, Naïve Bayes and KNN machine
learning algorithms for credit card fraud detection

Article in International Journal of Information Technology · February 2020

DOI: 10.1007/s41870-020-00430-y

CITATIONS READS

106 1,409

3 authors, including:

Meenakshi Mittal Satwinder Singh

Panjab University Central University of Punjab
15 PUBLICATIONS 188 CITATIONS 58 PUBLICATIONS 546 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Threshold Designing of Software metrics View project

All content following this page was uploaded by Satwinder Singh on 12 February 2022.

The user has requested enhancement of the downloaded file.

Int. j. inf. tecnol.
https://doi.org/10.1007/s41870-020-00430-y

ORIGINAL RESEARCH

Comparison and analysis of logistic regression, Naı̈ve Bayes

and KNN machine learning algorithms for credit card fraud
detection
Fayaz Itoo1 • Meenakshi2 • Satwinder Singh2

Received: 5 November 2019 / Accepted: 22 January 2020

Bharati Vidyapeeth’s Institute of Computer Applications and Management 2020

Abstract Financial fraud is a threat which is increasing on other prediction models developed from Naı̈ve Bayes and
a greater pace and has a very bad impact over the economy, K-nearest neighbour. Better results are also seen by
collaborative institutions and administration. Credit card applying under sampling techniques over the data before
transactions are increasing faster because of the advance- developing the prediction model.
ment in internet technology which leads to high depen-
dence over internet. With the up-gradation of technology Keywords Credit card fraud Fraud detection Random
and increase in usage of credit cards, fraud rates become under-sampling Logistic regression Naı̈ve Bayes KNN
challenge for economy. With inclusion of new security
features in credit card transactions the fraudsters are also
developing new patterns or loopholes to chase the trans- 1 Introduction
actions. As a result of which behavior of frauds and normal
transactions change constantly. Also the problem with the Fraud are typically outlined as criminal duplicity with
credit card data is that it is highly skewed which leads to purpose of obtaining gain. With the fast growing depen-
inefficient prediction of fraudulent transactions. In order to dence on internet technology, the rate at which credit card
achieve the better result, imbalanced or skewed data is pre- frauds happen has also increased at an alarming rate.
processed with the re-sampling (over-sampling or under Almost all modes of transactions be it online or offline are
sampling) technique for better results. The three different made via credit cards. External credit card fraud detection
proportions of datasets were used in this study and random enjoys the inclination of majority of research work. There
under-sampling technique was used for skewed dataset. are two types of credit card frauds; one is inner card fraud
This work uses the three machine learning algorithms and the other is external card fraud. Inner card fraud takes
namely: logistic regression, Naı̈ve Bayes and K-nearest place when a false individuality is used to commit fraud
neighbour. The performance of these algorithms is recor- because of mutual accord between cardholders and bank on
ded with their comparative analysis. The work is imple- the other hand fraud that is categorised as external consists
mented in python and the performance of the algorithms is of taking the credit card to induce cash through dubious
measured based on accuracy, sensitivity, specificity, pre- means [1]. Credit card fraud might be considered as vital
cision, F-measure and area under curve. On the basis these issue and amounts to a huge worth for banking organisa-
measurements logistic regression based model for predictions and card establishment companies. With this colossal
tion of fraudulent was found to be a better in comparison to disadvantage present in transaction system, banking
organisations grade credit card fraud as a grave issue, to
curb the menace they have fully-fledged security systems
& Satwinder Singh
[email protected]
to keep a check on transactions and spot the frauds as
quickly as possible upon conceived. Fraud detection is
1
Central University of Punjab, Bathinda, India necessary so that we can impede the impact of dubious
2
A.P. Department of Computer Science and Technology, transactions on services of delivery, costs, and company
Central University of Punjab, Bathinda, India name. Due to use of Machine learning, it has been helpful

123
Int. j. inf. tecnol.

in finding variety of the mandatory business issues such as Learning Algorithms (Random forest, Decision trees,
detecting email spam, targeted product recommendation, Support Vector Machines, Bayesian Networks, MLP,
correct diagnosing etc. The promotion of machine learning Naı̈ve Bayes and many more). Review of some of the
has been attributed to the increasing process power, research papers related to credit card fraud detection and its
availableness of huge information and improvement in prevention is as follows:
statistical modelling [2]. The most difficult thing for banks Padvekar et al. [8] demonstrated that credit card mis-
and commerce industry is the fraud management. The representation are frequently distinguished utilizing hidden
quantity of transactions has multiplied because of associate markov model all through transactions. hidden markov
excessive number of payment channels—credit/debit cards, model gets a high misrepresentation inclusion joined with a
smartphones, and kiosks. At the same time, criminals low false alert rate. They utilized the scopes of exchange
became adept at finding loopholes. Hence it is not easy to amount as the perception images, while the classifications
authenticate transactions or it is very tough for businesses of items are contemplated to be conditions of the HMM.
to authenticate transactions [3]. They arranged a strategy for finding the expense profile of
Data researchers have been quite successful in resolving cardholders, correspondingly as use of this data in deciding
this problem with machine learning and predictive ana- the value of observation symbols and estimate of the model
lytics. The problem that comes with the credit card fraud parameters. It’s is also explained that how HMM is able to
detection is the skewed or unbalanced data and the algo- detect approaching transaction as fraudulent or not. Rela-
rithms treat the minority categories as a noise and only tive investigations uncover that the Accuracy of the
predicts the majority category accurately not the minority framework is on the precarious edge of 80% over an
category [4, 5]. So in the case of a skewed data there are extensive variety inside the data. The framework is fur-
various resampling techniques which can be applied on a thermore ascendable for taking care of huge volumes of
skewed or imbalanced data and can produce better results. transactions.
The imbalanced problem of the dataset can also be solved Khare and Sait [9] examined and checked the presen-
by a technique called ensemble learning framework which tation of Decision Tree, Random Forest, SVM and Logistic
guarantees the uprightness of test features and the method Regression classifier algorithms. The methods were used
depends on training set split and congregate [6]. In order to on the raw and pre-handled information. From the inves-
overcome with the problem of the false alarm rates and tigations the outcome that has been finished up is that
increase the efficiency of credit card fraud detection rate Logistic regression has exactness of 97.7% while SVM
various approaches like outlier detection methods have indicates exactness of 97.5% and Decision tree demon-
been used [7]. These approaches can optimise the best strates exactness of 95.5% yet the best outcomes are
solution and their implementation on bank credit card fraud acquired by Random forest with an exact precision of
detection system (CCFDS) are very useful in detecting and 98.6%. The outcomes acquired therefore reason that Ran-
preventing the fraudulent transaction [7]. dom forest demonstrates the most precise and high accu-
racy of 98.6% in issue of credit card fraud detection with
dataset given by ULB.
2 Related work Banerjee et al. [10] examined the various machine
learning classifiers trained on a public dataset to analyse
Credit card fraud is on the rise as the number of online correlation of certain factors with fraudulence. The better
transactions are increasing. In order to prevent fraudulent metrics are used to determine false negative rates and the
transactions and to detect the credit card fraud there should performance of random sampling was measured to deal
be the most effective methods which are able to detect and with the class imbalance of the dataset. The support vector
prevent fraudulent transactions before they make a huge machine performed better for detecting credit card fraud
loss to the banks and credit card holders. There are various under realistic conditions. The comparison between the
methods of fraud detection which are proposed by deep learning and regression algorithmic models is done to
researchers and are somehow effective in credit card fraud determine which algorithm and combination of factors
detection but the real problem lies in availability of data- provides the most accurate method of classifying a credit-
sets due to security issues and datasets are very imbal- card transaction as fraudulent or non-fraudulent. The best
anced. Some of the methods proposed for credit card fraud algorithm for analysis of datasets with a close to 1:1 ratio
detection are Neural Networks, Fusion of Dempster Shafer, of fraudulent to non-fraudulent transactions is the Random
Bayesian Learning, Hidden Markow Model, Fuzzy Dar- Forest Classifier, assuming the fraud-to-not fraud distri-
winian System, Outlier detection methods, Support Vector bution of the testing and training set is the same.
Machines, Genetic Algorithm, Covering Algorithm, Meta- Mishra and Ghorpade [4] analysed various classification
Classifiers, Data Mining, ensemble Learning, machine techniques using various metrics for evaluating various

123
Int. j. inf. tecnol.

classifiers. The models were trained based on various Testing datasets are assessed using trained model of the
classification and ensembling techniques. The models used classifiers. Step by step methodology has been explained as
were Logistic regression, Decision tree, Random Forest, follows. Figure 1 explains the flow diagram of research
Support Vector machines and various ensembling models. work.
These models were trained and the results were obtained.
Results obtained from the actual dataset were also good 3.1 Collection of dataset and pre-processing
and with the recall of about 96% the Random Forest
classifier performed better as compared to other classifiers. The dataset is acquired from the Kaggle which hosts the
Xuan et al. [11] utilized two sorts of random forest dataset from credit card fraud detections [12]. The dataset
algorithms to prepare the features of typical and strange is crafted from the MasterCard transactions of European
transactions. The two arbitrary random forest algorithms cardholders on Sept 2013. The transactions that occurred
utilized are thought about which are distinctive in their for 2 days were recorded that amounts to 284,807 entries.
base classifiers and their presentation is examined on credit The positive category (fraud cases) conjure 0.172% of the
card fraud detection. The two algorithms utilized are ran- transactions information. The features are transformed and
dom-tree-based random forests and CART-based random are reduced to 28 principal components as PCA is applied
forest whose preparation set originates from bootstrapped on them and are transformed into numerical input values.
tests. The three experiments were performed for the two These principal components are named as V1, V2, V3 …
algorithms on different datasets with different proportions and V28. The features include credit limit, gender, marital
of datasets. The performance of these algorithms were status, previous months bills, previous months payments,
measured for all the three experiments and the metrics status of existing account, salary assignments, credit his-
which were added are intervention rate of transaction and tory, other credits existing, purpose, credit amount, present
average rate of model. Cart based random forest performed employment, savings account, personal status, other debt-
better in all the experiments performed. ors, property, age in months, Housing, number of existing
credits, Job, Telephone, foreign worker, ID, Credit card
number, PIN, Time, Amount and Class. From the statistics
3 Methodology of total entries and fraud cases it can be inferred that the
dataset is very unbalanced and is inclined towards the
In this research work the machine learning classifiers
namely: logistic regression (LR), K-nearest neighbour
(KNN) and Naı̈ve Bayes (NB)are put to application with
Python serving as the language of implementation. The
experiments are carried out and the evaluation of these
experiments is done using the confusion matrix and per-
formance comparison of the algorithms is analysed with
the help of measures namely: accuracy, sensitivity, speci-
ficity, precision, F-measure and area under curve (AUC).
There are various stages which are involved in creating
and processing of classifiers which include; gathering of
data, pre-processing of data, training of algorithms, testing
of algorithms and analysis of classifiers.
At the time of pre-processing of data, the data is trans-
formed into viable format and is then sampled using the
sampling techniques. A technique called random under-
sampling is carried out on a dataset because of the highly
imbalanced dataset which is more biased towards the
negative cases (non-fraud cases) and due to the under-
sampling of dataset, three sets of data distribution is
achieved. The features selected are the principal compo-
nents and these components are actually the product of
principal component analysis dimensionality reduction
resulting in 28 principal components which are represented
as V1, V2 …, and V28. During the training stage the
algorithms are given the input as the processed data. Fig. 1 Flow diagram of research work

123
Int. j. inf. tecnol.

negative class. The background details of the features are 3.3 Classification techniques
hidden and cannot be shown due to privacy issues. The
time contains the seconds passed between every exchange In the dataset of credit cards there are two values for
and the primary exchange in the dataset. The ‘Amount’ classification of transactions which means that it is a binary
feature is the exchange amount. The ‘class’ feature is used classification problem where transactions are classified
to represent whether the transaction is fraud or non-fraud either as fraud (1) or non-fraud (0). After resampling of the
and for the class value of 1 it represents the fraud trans- data by under-sampling, the classifiers are trained using the
action in the dataset and for the class value of 0 it repre- training data to evaluate the methods. In this study classi-
sents the non-fraud transactions. fication techniques named as: logistic regression (LR),
Naı̈ve Bayes (NB) and K-nearest neighbour (KNN) are
3.2 Under-sampling of dataset used.

To cope with unbalanced datasets, modifying classification 3.4 Dataset division

algorithms in order to gain improvements or equalisation of
classes within the training information conjointly known as The credit card dataset is split into two halves one training
data pre-processing is needed. Pre-processed information is set and other testing set. In this study we chose the ratio
provided as input to the machine learning rule because of 50:50, 34:66 and 25:75 (fraud: non fraud). Figure 3 below
its wide use. The chief target of knowledge pre-processing shows the overview of dataset division, which is split as
is increasing the number of the minority category or training and testing and resampling (random under-sam-
decreasing the frequency of the bulk class. This can be pling) is done. Also Tables 1, 2 and 3 shows more details
done with the aim of achieving same variety of instances about the dataset division.
for each the categories. In Table 1 division of the dataset is done by the ratio
Random under-sampling (RUC) is a widely used 50:50; it means that same number of fraud and non-fraud
resampling method and as such is used in our study. The instances have been taken to train the three classification
choice of RUC is made on basis of its simplicity and techniques.
effectiveness. The aim of RUC is to adjust class dispersion In Table 2 division of the dataset (for fraud and non-
by means of arbitrarily dispensing with dominant part class fraud instances) is done in the ratio of 34:66; these numbers
precedents. The procedure is done until the greater part and of instances have been taken to train the three classification
minority occurrences are adjusted. RUC improves run time techniques.
and capacity issues due to decreasing the quantity of In Table 3 division of the dataset (for fraud and non-
preparing information tests in huge datasets. The lone fraud instances) is done in the ratio of 25:75; these numbers
limitation RUC suffers from is loss of some important of instances have been taken to train the three classification
information. In this study the dataset is distributed in three techniques.
proportions taken as (fraud: non-fraud) ratio and the three
proportions are: 50:50, 34:66 and 25:75. The results are
taken for random under-sampling method for all the data
distributions taken (Fig. 2).

Fig. 2 Random under-sampling working of the dataset Fig. 3 Division of dataset

123
Int. j. inf. tecnol.

Table 1 DIvision of dataset by ratio 50:50 3.5 Performance evaluation

Data division Training data Resampling method
RUS Performance evaluations were done for the three different
classification techniques namely LR, NB and KNN for the
Fraud 492 344 resampling technique (RUS) used. The four elementary
Non-fraud 284,315 344 matrices through which performance evaluations are pre-
Total 284,807 688 dicted are as: True Positive (TP), True negative (TN), False
positive (FP) and false Negative (FN). True positives are
the cases which are predicted as positive and in reality they
are positive as well. True negatives are cases anticipated
Table 2 Division of dataset by ratio 34:66
appropriately as negative. False positive are cases antici-
Data division Training data Resampling method pated as positive yet are negative cases. False negative are
RUS
cases delegated negative yet are actually positive. The
Fraud 492 341 correlations between these metrics is given in an exceed-
Non-fraud 284,315 692 ingly confusion metrics. Additionally the achievement of
Total 284,807 1033 three algorithms are compared in terms of sensitivity,
specificity, accuracy, F-measure and area under curve
(AUC). The metrics used are calculated using the confu-
sion metrics as shown in the Table 5 below.
Table 3 Division of dataset by ratio 25:75 Accuracy Accuracy is defined as the ratio of total
Data division Training data Resampling method number of predicted transactions that are correct [13]
RUS TP þ TN
Accuracy ¼ ð1Þ
Fraud 492 353 TP þ FP þ TN þ FN
Non-fraud 284,315 1024 Sensitivity The proportion of positive observed values
Total 284,807 1377 correctly predicted as positive. It is also called as True
Positive Rate (TPR) [13]
TP
Sensitivity ðRecallÞ ¼ ð2Þ
Since 30% of dataset is used for testing the models and TP þ FN
after resampling (random under-sampling) the number of Specificity Specificity is defined as, with how much
fraud and non-fraud cases in the testing data for the three accuracy the negative (legitimate) cases are classified and
different proportions is given in Table 4. in our caseit gives the accuracy on prediction of legitimate
Now onwards we will use A = 50:50, B = 34:66, transactions classification. It is also called as True Negative
C = 25:75. Rate (TNR) [13]
After selection of the training and testing datasets, three TN
different classification techniques namely LR, NB and Specificity ¼ ð3Þ
FP þ TN
KNN have been trained using the training dataset and we
get the corresponding three models. Then testing dataset Precision The proportion of positive (fraud) predictions
have been tested using these three models and then per- that are actually correct [13].
formance evaluation has been done. TP
Precision ¼ ð4Þ
TP þ FP
F-measure F-measure gives the accuracy of the test
which means that it gives the accuracy of experiments
performed. It uses the both precision and recall to compute

Table 4 Preparation of testing dataset

Data proportion Fraud Non-fraud Total Table 5 Confusion matrix of credit card dataset
Predicted fraud Predicted non-fraud
50:50 35 261 296
34:66 137 306 443 Actual fraud TP FN
25:75 141 450 591 Actual non-fraud FP TN

123
Int. j. inf. tecnol.

its value. The best value for f1 score is considered at value proportions as compared to other two algorithms in Fig. 6.
1 [14]. In 25:75 ratio represented by ‘C’ all the algorithms per-
Precision:Recall formed better in term of accuracy measurement. So this
F-measure ¼ 2: ð5Þ split or ratio is considered to be the best for further training
Precision þ Recall
and testing purpose. Same results with better performance
Area under curve (AUC) AUC represents degree or of Logistic Regression can be seen for other performance
measure of separability that is how much model is capable measurement parameters (precision and F-measure) in
of differentiating between the classes [14]. Figs. 7 and 8 respectively.
1
AUC ¼ :ðSensitivity þ SpecificityÞ ð6Þ
2 4.1 Analysis of algorithms for the three proportions

The values of parameters for the three proportions are

depicted in Tables 6, 7 and 8. The Logistic Regression
4 Result and discussion
shows higher values for all the parameters because it
maximizes the conditional data likelihood function. The
This part deals with the results gathered during experi-
provisional data likelihood is the probability of the noticed
ments. In the Tables 6, 7 and 8 given below, the compar-
Y values in the training data, constrained on their respec-
ison results of all the three classifiers in the resampling
tive X values. The second reason is that the feature values
technique used for the ratios; 50:50, 34:66 and 25:75
in Logistic Regression are dependent and there is a much
respectively are shown. Parameters chosen for comparison
more correlation between these features which contribute
of results are sensitivity, specificity, accuracy, precision,
to the prediction of new data point. Logistic Regression
F-measure and AUC. From the results obtained it is clear
also shows the higher accuracy (91.2%, 92.3% and 95.9%
that in all the three proportions logistic regression (LR)
in case of A, B and C ratios respectively) for a medium size
dominates with higher accuracy for the random under-
dataset and it is able to estimate the patterns for the fraud
sampling method used.
data in the balanced dataset. Also, in this study data bal-
(i) Comparison of classification techniques by ratio A ancing was done by under sampling methodology for the
(ii) Comparison of classification techniques by ratio B fraud detection. Above might be the reasons for better
performance of each parameter (sensitivity, specificity,
The value of the parameters for the three classifiers for
precision, F-measure and AUC) in logistic regression The
the ratio B is given in Table 7.
decision boundary is set by the maximum conditional data
(iii) Comparison of classification techniques by ratio C likelihood function.
As noticed in Tables 6, 7 and 8 the Naı̈ve Bayes algo-
The sensitivity comparison of each proportion is shown
rithm shows lower accuracy than the logistic regression
in Fig. 4. Which represent the better result for logistic
and this might be due to the reason that the features are
regression in comparison to other techniques Naı̈ve Bayes
independent of each other and each feature for the Naı̈ve
and KNN.
Bayes classifier contributes individually for the prediction
Specificity comparison of each proportion is represented
of new data point. Secondly, the features are not correlated
graphically in Fig. 5 which shows parallel results for
and this supposition dramatically decreases the number of
logistic regression and Naı̈ve Bayes for each proportion of
parameters that must be estimated to learn the classifier and
under sampling ratio A, B and C
this might be the reason algorithm shows sometime lower
As shown in Figs. 6, 7 and 8 in all the three proportions,
performance values for sensitivity, specificity, accuracy,
logistic regression algorithm shows the better result. It
precision, F-Measure and AUC compared to the logistic
gives the highest accuracy in all the three proportions as
regression. For example sensitivity measure for Naı̈ve
compared to Naı̈ve Bayes and KNN (Fig. 6). However
Bayes is 0.757, 0.718 and 0.664 as compared to 0.878,
KNN showed the lowest accuracy in all the three
0.777 and 0.839 Sensitivity measure of logistic regression

Table 6 Comparison of
Resampling method: random under-sampling
classification techniques by
ratio A Techniques Sensitivity Specificity Accuracy Precision F-measure AUC

Logistic regression 0.878 0.949 0.912 0.951 0.913 0.914

Naı̈ve Bayes 0.757 0.964 0.854 0.959 0.846 0.860
K-nearest neighbour 0.687 0.669 0.679 0.701 0.694 0.678

123
Int. j. inf. tecnol.

Table 7 Comparison of
Resampling method: random Under-Sampling
classification techniques by
ratio B Techniques Sensitivity Specificity Accuracy Precision F-Measure AUC

Logistic regression 0.777 1.0 0.923 1.0 0.875 0.888

Naı̈ve Bayes 0.718 1.0 0.902 1.0 0.836 0.859
K-nearest neighbour 0.477 0.789 0.681 0.544 0.508 0.633
Bold values highlight the parameters
Table 8 Comparison of
Resampling method: random under-sampling
classification techniques by
ratio C Techniques Sensitivity Specificity Accuracy Precision F-Measure AUC

Logistic regression 0.839 0.997 0.959 0.991 0.909 0.918

Naı̈ve Bayes 0.664 0.995 0.915 0.979 0.789 0.829
K-nearest neighbour 0.405 0.861 0.751 0.483 0.441 0.633
Bold values highlight the parameters

Sensitivity Accuracy
1 1.2
0.9
1
0.8
0.7 0.8 Logistic
Logistic
0.6 Regression Regression
0.5 Naïve Bayes 0.6 Naïve Bayes
0.4
K-Nearest 0.4 K-Nearest
0.3 Neighbour Neighbour
0.2 0.2
0.1
0 0
A B C A B C

Fig. 4 Sensitivity Fig. 6 Accuracy

Specificity
Precision
1.2
1.2

1 1

0.8 0.8 Logistic Regression

Logistic
Regression
0.6 Naïve Bayes
0.6 Naïve Bayes
0.4 K-Nearest
0.4 K-Nearest Neighbour
Neighbour
0.2

0.2
0
A B C
0
A B C Fig. 7 Precision

Fig. 5 Specificity

123
Int. j. inf. tecnol.

F-Measure respectively. Table 6 clearly shows better performance for

1 all the metrics (else than Accuracy and Precision) as
0.9 compared to other proportions shown in Tables 7 and 8.
0.8 The KNN showed the accuracy of 75% for 50:50 propor-
0.7 Logistic Regression tion which is better than the other proportions as it dif-
0.6 ferentiates the classes more accurately when the training
0.5 Naïve Bayes data increases. This is clear that when the training data is
0.4 increased the algorithms are performing better and this
0.3 K-Nearest shows that the data proportion of 25:75 is better for training
Neighbour
0.2 the classifiers. For all the proportions taken LR performs
0.1 very well with at most 95% correctness. The accuracy of
0 all the classifiers for all the three data proportions is shown
A B C
in the Fig. 6.
Fig. 8 F-measure

for respectively three different ratios of 50:50, 34:66 and 5 Conclusion and future work
25:75. Also Naı̈ve Bayes algorithm has a greater bias but
lower variance than logistic regression which might be The research work was carried out with the purpose of
support the under sampling methodology of data balancing. comparing the ability of machine learning algorithms as to
K-nearest neighbour requires a distance or measure the how accurately they differentiate and classify the fraud and
separation characterized between two information. In pro- non-fraud transactions of the credit card dataset with ran-
cedure of KNN, it characterizes any approaching transac- dom under sampling method (RUS) and to check out if the
tion by ascertaining separation of closest point to new performance is improved or not. Logistic Regression (LR)
approaching transaction. At that point if the closest showed the optimal performance for all the data propor-
neighbour be deceitful, then the transaction demonstrates tions as compared to Naı̈ve Bayes (NB) and K-Nearest
as a fraud. The estimation of K is utilized as, a little and Neighbour (KNN). LR was successful in getting higher
odd to break the ties (normally 1, 3 or 5). Bigger K values accuracy as compared to Naı̈ve Bayes and KNN. The LR
can lessen the impact of boisterous dataset. In this algo- showed the maximum accuracy of 95%, NB showed 91%
rithm, distance between two information instances is and KNN 75%. Also LR technique shows the better Sen-
determined utilizing Euclidean distance. For multivariate stivity, Specificity, Precision and F-Measure as compare to
information, distance is typically determined for each NB and K-NN technique. It has also been observed that
instance and after that consolidated. The algorithm shows being a supervised techniques (LR and Naı̈ve Bayes) shows
the poor accuracy for the proportion C (50:50) in visual- a better results in each case as compared to un-supervised
ising the results from bar graphs in Figs. 4, 5, 6, 7 and 8 technique K-NN.
and this is due to the small sample of training data as there There can be other resampling methods as well which
is much more similarity between the fraud and non-fraud could be put to application for the skewed dataset for credit
cases and the algorithm does not efficiently differentiates card fraud detection (CCFD). The resampling methods
the patterns in fraud and non-fraud cases. could be improved to get better results. Also using our
As it is clear from the results of the experiments that the statistics could be compared with the other techniques like
accuracy of classifiers increases as the training data is Random-Forest, SVC, Decision-Tress, Neural Network and
increased. Table 6 summarises the information for the ratio Genetic Algorithm. The main limitation of Random Under-
34:66 for the Random Under-sampling (RUS). In this sampling is that some information could be lost and new
proportion logistic regression and Naı̈ve Bayes have the resampling methods could be devised for achieving opti-
same specificity rate which is 1.0 and it means that both the mal results which can prove helpful in credit card fraud
classifiers classified the negative cases (non-fraud) with detection (CCFD) in future. Likewise our results might be
100% accuracy. This might be due to reason, as the training useful and can offer further help to the association to
data sample increases the accuracy of both the classifiers assemble a vastly improved credit card fraud detection
increases. Since, both the algorithms estimate the priori system (CCFDS) which can be better in dealing with the
probability which increases with number of samples in the skewed information and utilize the better measurements to
training data and thus helps in classifying the data samples assess the outcomes.
more accurately. It is depicted by the AUC values which is
0.89 and 0.85 for logistic regression and Naı̈ve Bayes

123
Int. j. inf. tecnol.

References Proceedings of the 3rd IEEE international conference on advan-

ces in electrical and electronics, information, communication and
1. Kundu A, Panigrahi S, Sural S, Majumdar AK (2009) BLAST- bio-informatics, AEEICB 2017, pp 255–258
SSAHA hybridization for credit card fraud detection. IEEE Trans 8. Padvekar SA, Kangane PM, Jadhav KV (2016) Credit card fraud
detection system. Int J Eng Comput Sci 5(4):16183–16186
Dependable Secure Comput 6(4):309–315
2. Guo T, Li G-Y (2008) Neural data mining for credit card fraud 9. Khare N, Sait SY (2018) Credit card fraud detection using
detection. In: International conference on machine learning and machine learning models and collating machine learning models.
cybernetics Int J Pure Appl Math 118(20):825–838
10. Banerjee R, Bourla G, Chen S, Kashyap M, Purohit S, Battipaglia
3. Ghobadi F, Rohani M (2016) Cost sensitive modeling of credit
card fraud using neural network strategy. In: International con- J (2018) Comparative analysis of machine learning algorithms
ference of signal processing and intelligent systems (ICSPIS) through credit card fraud detection. New Jersey’s Governor’s
4. Mishra A, Ghorpade C (2018) Credit card fraud detection on the School of Engineering and Technology, Piscataway, pp 1–10
skewed data using various classification and ensemble tech- 11. Xuan S, Liu G, Li Z, Zheng L, Wang S, Jiang C (2018) Random
niques. In: 2018 IEEE International students’ conference on forest for credit card fraud detection. In: ICNSC 2018—15th
electrical, electronics and computer science, SCEECS 2018 IEEE International conference on networking, sensing and con-
5. Raj SE, Portia AA (2011) Analysis on credit card fraud detection trol, pp 1–6
methods. In: International conference on computer, communica- 12. Hordri NF, Yuhaniz SS, Firdaus N, Azmi M, Shamsuddin SM
tion and electrical technology (2018) Handling class imbalance in credit card fraud using
6. Wang H, Zhu P, Zou X, Qin S (2018) An ensemble learning resampling methods. Int J Adv Comput Sci Appl 9(11):390–396
13. Awoyemi JO, Adetunmbi AO, Oluwadare SA (2017) Credit card
framework for credit card fraud detection based on training set
partitioning and clustering. In: 2018 IEEE SmartWorld, Ubiquitous fraud detection using machine learning techniques: a comparative
Intelligence & Computing, Advanced & Trusted Computing, Scal- analysis. In: Proceedings of the IEEE international conference on
able Computing & Communications, Cloud & Big Data Computing, computing, networking and informatics, ICCNI 2017, vol 2017–
Internet of People and Smart City Innovation (SmartWorld/SCAL- Jan, pp 1–9
COM/UIC/ATC/CBDCom/IOP/SCI), Guangzhou, pp 94–98. 14. Hordri NF, Yuhaniz SS, Azmi NFM, Shamsuddin SM (2018)
https://doi.org/10.1109/SmartWorld.2018.00051 Handling class imbalance in credit card fraud using resampling
7. Malini N, Pushpa M (2017) Analysis on credit card fraud iden- methods. Int J Adv Comput Sci Appl 9(11):390–396
tification techniques based on KNN and outlier detection. In:

123

View publication stats

Ihkk
No ratings yet
Ihkk
62 pages
Performance Evaluation of Machine Learning
No ratings yet
Performance Evaluation of Machine Learning
5 pages
Group
No ratings yet
Group
41 pages
Seminar II Initial Review
No ratings yet
Seminar II Initial Review
13 pages
PMBOK 6th Edition - ITTO
83% (64)
PMBOK 6th Edition - ITTO
69 pages
Final Report
100% (1)
Final Report
79 pages
Credit Card Fraud Detection (Book) 15
No ratings yet
Credit Card Fraud Detection (Book) 15
73 pages
Credit Card Fraud Detection Using Machine Learning Techniques A Comparative Analysis
No ratings yet
Credit Card Fraud Detection Using Machine Learning Techniques A Comparative Analysis
9 pages
Machine Learning CRE
No ratings yet
Machine Learning CRE
20 pages
Proposal-1 2
No ratings yet
Proposal-1 2
26 pages
A Review On Credit Card Fraud Detection Using Mach
No ratings yet
A Review On Credit Card Fraud Detection Using Mach
6 pages
Credit Card Fraud Detection Using Hybrid Machine Learning Algorithm
No ratings yet
Credit Card Fraud Detection Using Hybrid Machine Learning Algorithm
6 pages
Credit Card Fraud Detection Framework A
No ratings yet
Credit Card Fraud Detection Framework A
5 pages
Project Review Credit Card Fraud Detection Using Machine
No ratings yet
Project Review Credit Card Fraud Detection Using Machine
14 pages
Design and Implementation of Different Machine Learning Algorithms For Credit Card Fraud Detection
No ratings yet
Design and Implementation of Different Machine Learning Algorithms For Credit Card Fraud Detection
6 pages
Evaluation of Supervised Machine Learning Algorithms For Credit Card Fraud Detection A Comparison
No ratings yet
Evaluation of Supervised Machine Learning Algorithms For Credit Card Fraud Detection A Comparison
6 pages
Group 23
No ratings yet
Group 23
11 pages
Online Transaction Fraud Detection Using Backlogging On e Commerce Website IJERTV11IS050319
No ratings yet
Online Transaction Fraud Detection Using Backlogging On e Commerce Website IJERTV11IS050319
6 pages
Esci50559.2021.9397029
No ratings yet
Esci50559.2021.9397029
5 pages
Fam Report Final Last Doc 2
No ratings yet
Fam Report Final Last Doc 2
17 pages
Data Quality Analysis Based Machine Learning Model
No ratings yet
Data Quality Analysis Based Machine Learning Model
28 pages
Credit Card Fraud Detection Report
No ratings yet
Credit Card Fraud Detection Report
6 pages
Credit Card Fraud Detection System Using Machine Learning Process
No ratings yet
Credit Card Fraud Detection System Using Machine Learning Process
4 pages
Credit Card Fraud Detection Using Machine Learning
No ratings yet
Credit Card Fraud Detection Using Machine Learning
6 pages
A Performance Analysis of Machine Learning Techniques For Credit Card Fraud Detection
No ratings yet
A Performance Analysis of Machine Learning Techniques For Credit Card Fraud Detection
21 pages
Credit Card Fraud Detection Using Machine Learning Techniques
No ratings yet
Credit Card Fraud Detection Using Machine Learning Techniques
9 pages
Comparative Study of Machine Learning Algorithms F
No ratings yet
Comparative Study of Machine Learning Algorithms F
11 pages
1 Report
No ratings yet
1 Report
55 pages
Financial Fraud Detection in Healthcare Using Machine and Deep Learning
No ratings yet
Financial Fraud Detection in Healthcare Using Machine and Deep Learning
25 pages
2 PB
No ratings yet
2 PB
10 pages
Credit Card Fraud Detection Techniques
No ratings yet
Credit Card Fraud Detection Techniques
8 pages
Major 1 2nd
No ratings yet
Major 1 2nd
13 pages
Synopsis ON "Credit Card Fraud Detection System"
100% (1)
Synopsis ON "Credit Card Fraud Detection System"
14 pages
Research Paper Danish
No ratings yet
Research Paper Danish
6 pages
MPML10 2022 FR
No ratings yet
MPML10 2022 FR
24 pages
Credit Card Fraud Detection System
No ratings yet
Credit Card Fraud Detection System
16 pages
1 PB
No ratings yet
1 PB
9 pages
Paper 2
No ratings yet
Paper 2
9 pages
2024 Generative AI Risk Management and The NIST Generative AI PR
No ratings yet
2024 Generative AI Risk Management and The NIST Generative AI PR
54 pages
Machine Learning For Credit Card Fraud D
No ratings yet
Machine Learning For Credit Card Fraud D
6 pages
Fraud Detection in Banking Data by Machine Learning Techniques
No ratings yet
Fraud Detection in Banking Data by Machine Learning Techniques
10 pages
A Review Credit Card Fraud Detection in Banks Using Machine Learning Algorithms
No ratings yet
A Review Credit Card Fraud Detection in Banks Using Machine Learning Algorithms
7 pages
A Study On Credit Card Fraud Detection Using Machine Learning
No ratings yet
A Study On Credit Card Fraud Detection Using Machine Learning
4 pages
Report Credit Card
No ratings yet
Report Credit Card
26 pages
Implementation of Credit Card Fraud Detection Using Support Vector Machine
No ratings yet
Implementation of Credit Card Fraud Detection Using Support Vector Machine
13 pages
Abstract
No ratings yet
Abstract
2 pages
Real-Time Credit Card Fraud Detection Using Machine Learning
No ratings yet
Real-Time Credit Card Fraud Detection Using Machine Learning
6 pages
A Review On Credit Card Fraud Detection Using Machine Learning
No ratings yet
A Review On Credit Card Fraud Detection Using Machine Learning
4 pages
Credict Card
No ratings yet
Credict Card
6 pages
Credit Card Fraud Detection Using Machine Learning
No ratings yet
Credit Card Fraud Detection Using Machine Learning
6 pages
Credit Card Detection
No ratings yet
Credit Card Detection
9 pages
Credit Card Fraud Detection1
No ratings yet
Credit Card Fraud Detection1
5 pages
Credit Card Fraud Detection Using A Combined Approach of Genetic Algorithm and Random Forest
No ratings yet
Credit Card Fraud Detection Using A Combined Approach of Genetic Algorithm and Random Forest
4 pages
Synopsis Major Project CreditCardFraudDetection
No ratings yet
Synopsis Major Project CreditCardFraudDetection
16 pages
Research Paper 4 (Abnormal Transactions)
No ratings yet
Research Paper 4 (Abnormal Transactions)
7 pages
10.1007@s41870 020 00430 y PDF
No ratings yet
10.1007@s41870 020 00430 y PDF
9 pages
Credit Card Fraud Detection Using Machine Learning PDF
No ratings yet
Credit Card Fraud Detection Using Machine Learning PDF
6 pages
A Comparative Analysis of Credit Card Fraud Detection Using Machine Learning Techniques
No ratings yet
A Comparative Analysis of Credit Card Fraud Detection Using Machine Learning Techniques
2 pages
Equity Theory of Motivation
No ratings yet
Equity Theory of Motivation
8 pages
ISB PM Handbook 2021 - Prep Material
No ratings yet
ISB PM Handbook 2021 - Prep Material
204 pages
Study On Stakeholder Management
No ratings yet
Study On Stakeholder Management
21 pages
Jurnal, Senam Yoga TRDP Stres Remaja
No ratings yet
Jurnal, Senam Yoga TRDP Stres Remaja
5 pages
Job-Shop-Scheduling Uain Heurstic Bottle Neck Shift
No ratings yet
Job-Shop-Scheduling Uain Heurstic Bottle Neck Shift
7 pages
Practical Workbook: CS-325 Software Development and Testing
No ratings yet
Practical Workbook: CS-325 Software Development and Testing
35 pages
Discussion Part of A Literature Review
100% (2)
Discussion Part of A Literature Review
6 pages
Written Report On Suicide
100% (1)
Written Report On Suicide
6 pages
Face Recognition Based Attendance Management System: Smitha, Pavithra S Hegde, Afshin
No ratings yet
Face Recognition Based Attendance Management System: Smitha, Pavithra S Hegde, Afshin
3 pages
Platform Labour and Global Logistics A Research Companion Immanuel Ness Instant Download
No ratings yet
Platform Labour and Global Logistics A Research Companion Immanuel Ness Instant Download
86 pages
Cvidya Offeradvisor™: Pricing Analytics "Next Best Offer" Solution
No ratings yet
Cvidya Offeradvisor™: Pricing Analytics "Next Best Offer" Solution
2 pages
25-Karl Gunnar Holter PHD Project Sproytemembran
No ratings yet
25-Karl Gunnar Holter PHD Project Sproytemembran
57 pages
The Impact of The Slave Trade On Africa: Describe The Part Played
No ratings yet
The Impact of The Slave Trade On Africa: Describe The Part Played
18 pages
Writing Your Research Proposal Format by Dr. Avasha Rambiritch
No ratings yet
Writing Your Research Proposal Format by Dr. Avasha Rambiritch
19 pages
Company List - Karachi
No ratings yet
Company List - Karachi
7 pages
Standard ARPU Calculation Improvement Using Artifi
No ratings yet
Standard ARPU Calculation Improvement Using Artifi
19 pages
How To Write A Literature Review For An Action Research Project
100% (1)
How To Write A Literature Review For An Action Research Project
6 pages
Be (CS) Spring 2022 Semester: CS-438: Computer Systems Modeling
No ratings yet
Be (CS) Spring 2022 Semester: CS-438: Computer Systems Modeling
6 pages
Sync U.S. Federal Government and Politics Syllabus
No ratings yet
Sync U.S. Federal Government and Politics Syllabus
7 pages
Assesment of Risk
No ratings yet
Assesment of Risk
31 pages
Text 35
No ratings yet
Text 35
195 pages
Unit 5
No ratings yet
Unit 5
10 pages
0000189230
No ratings yet
0000189230
3 pages
Principles of Marketing
No ratings yet
Principles of Marketing
14 pages
Francesco Casetti and Italian Film Semiotics
No ratings yet
Francesco Casetti and Italian Film Semiotics
25 pages
Internet Usage by Students For Academic
No ratings yet
Internet Usage by Students For Academic
58 pages
BM60116 - Slides 3.0
No ratings yet
BM60116 - Slides 3.0
11 pages
Discrete Random Variable
No ratings yet
Discrete Random Variable
53 pages
Syahrul Research Design
No ratings yet
Syahrul Research Design
4 pages
Consumer Behavior of Bread and Its Influence On "Supply Chain Management" An Innovative Approach
No ratings yet
Consumer Behavior of Bread and Its Influence On "Supply Chain Management" An Innovative Approach
16 pages
Paper For ML Models
No ratings yet
Paper For ML Models
12 pages
Criticises Explainable Defect Pred
No ratings yet
Criticises Explainable Defect Pred
12 pages
X and Moving Range Charts: Exit Program
No ratings yet
X and Moving Range Charts: Exit Program
21 pages
Be (CS) Spring 2022 Semester: CS-438: Computer Systems Modeling
No ratings yet
Be (CS) Spring 2022 Semester: CS-438: Computer Systems Modeling
10 pages
Computer Systems Modeling (CS-438)
No ratings yet
Computer Systems Modeling (CS-438)
9 pages
WITS Product
No ratings yet
WITS Product
2 pages
SK ANTIPLAG in Short
No ratings yet
SK ANTIPLAG in Short
5 pages
Group-90 Synopsis Mini Project - I
No ratings yet
Group-90 Synopsis Mini Project - I
3 pages
Security Testing Handbook for Banking Applications
From Everand
Security Testing Handbook for Banking Applications
Arvind Doraiswamy
5/5 (1)
Anti fraud for Cheques and use of AI: Next gen realtime anti fraud 4 cheque processing
From Everand
Anti fraud for Cheques and use of AI: Next gen realtime anti fraud 4 cheque processing
Prabhs Uyyala
No ratings yet
Data Science Project Ideas for Thesis, Term Paper, and Portfolio
From Everand
Data Science Project Ideas for Thesis, Term Paper, and Portfolio
Zemelak Goraga
No ratings yet
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
From Everand
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
Zemelak Goraga
No ratings yet

Menakshi Satwinder

Uploaded by

Menakshi Satwinder

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Article in International Journal of Information Technology · February 2020

Meenakshi Mittal Satwinder Singh

SEE PROFILE SEE PROFILE

Threshold Designing of Software metrics View project

The user has requested enhancement of the downloaded file.

Comparison and analysis of logistic regression, Naı̈ve Bayes

Received: 5 November 2019 / Accepted: 22 January 2020

To cope with unbalanced datasets, modifying classification 3.4 Dataset division

Fig. 2 Random under-sampling working of the dataset Fig. 3 Division of dataset

Table 1 DIvision of dataset by ratio 50:50 3.5 Performance evaluation

Table 4 Preparation of testing dataset

The values of parameters for the three proportions are

Logistic regression 0.878 0.949 0.912 0.951 0.913 0.914

Logistic regression 0.777 1.0 0.923 1.0 0.875 0.888

Logistic regression 0.839 0.997 0.959 0.991 0.909 0.918

Fig. 4 Sensitivity Fig. 6 Accuracy

0.8 0.8 Logistic Regression

F-Measure respectively. Table 6 clearly shows better performance for

References Proceedings of the 3rd IEEE international conference on advan-

View publication stats

You might also like