Menakshi Satwinder
Menakshi Satwinder
net/publication/339294125
Comparison and analysis of logistic regression, Naïve Bayes and KNN machine
learning algorithms for credit card fraud detection
CITATIONS READS
106 1,409
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Satwinder Singh on 12 February 2022.
ORIGINAL RESEARCH
Abstract Financial fraud is a threat which is increasing on other prediction models developed from Naı̈ve Bayes and
a greater pace and has a very bad impact over the economy, K-nearest neighbour. Better results are also seen by
collaborative institutions and administration. Credit card applying under sampling techniques over the data before
transactions are increasing faster because of the advance- developing the prediction model.
ment in internet technology which leads to high depen-
dence over internet. With the up-gradation of technology Keywords Credit card fraud Fraud detection Random
and increase in usage of credit cards, fraud rates become under-sampling Logistic regression Naı̈ve Bayes KNN
challenge for economy. With inclusion of new security
features in credit card transactions the fraudsters are also
developing new patterns or loopholes to chase the trans- 1 Introduction
actions. As a result of which behavior of frauds and normal
transactions change constantly. Also the problem with the Fraud are typically outlined as criminal duplicity with
credit card data is that it is highly skewed which leads to purpose of obtaining gain. With the fast growing depen-
inefficient prediction of fraudulent transactions. In order to dence on internet technology, the rate at which credit card
achieve the better result, imbalanced or skewed data is pre- frauds happen has also increased at an alarming rate.
processed with the re-sampling (over-sampling or under Almost all modes of transactions be it online or offline are
sampling) technique for better results. The three different made via credit cards. External credit card fraud detection
proportions of datasets were used in this study and random enjoys the inclination of majority of research work. There
under-sampling technique was used for skewed dataset. are two types of credit card frauds; one is inner card fraud
This work uses the three machine learning algorithms and the other is external card fraud. Inner card fraud takes
namely: logistic regression, Naı̈ve Bayes and K-nearest place when a false individuality is used to commit fraud
neighbour. The performance of these algorithms is recor- because of mutual accord between cardholders and bank on
ded with their comparative analysis. The work is imple- the other hand fraud that is categorised as external consists
mented in python and the performance of the algorithms is of taking the credit card to induce cash through dubious
measured based on accuracy, sensitivity, specificity, pre- means [1]. Credit card fraud might be considered as vital
cision, F-measure and area under curve. On the basis these issue and amounts to a huge worth for banking organisa-
measurements logistic regression based model for predic- tions and card establishment companies. With this colossal
tion of fraudulent was found to be a better in comparison to disadvantage present in transaction system, banking
organisations grade credit card fraud as a grave issue, to
curb the menace they have fully-fledged security systems
& Satwinder Singh
[email protected]
to keep a check on transactions and spot the frauds as
quickly as possible upon conceived. Fraud detection is
1
Central University of Punjab, Bathinda, India necessary so that we can impede the impact of dubious
2
A.P. Department of Computer Science and Technology, transactions on services of delivery, costs, and company
Central University of Punjab, Bathinda, India name. Due to use of Machine learning, it has been helpful
123
Int. j. inf. tecnol.
in finding variety of the mandatory business issues such as Learning Algorithms (Random forest, Decision trees,
detecting email spam, targeted product recommendation, Support Vector Machines, Bayesian Networks, MLP,
correct diagnosing etc. The promotion of machine learning Naı̈ve Bayes and many more). Review of some of the
has been attributed to the increasing process power, research papers related to credit card fraud detection and its
availableness of huge information and improvement in prevention is as follows:
statistical modelling [2]. The most difficult thing for banks Padvekar et al. [8] demonstrated that credit card mis-
and commerce industry is the fraud management. The representation are frequently distinguished utilizing hidden
quantity of transactions has multiplied because of associate markov model all through transactions. hidden markov
excessive number of payment channels—credit/debit cards, model gets a high misrepresentation inclusion joined with a
smartphones, and kiosks. At the same time, criminals low false alert rate. They utilized the scopes of exchange
became adept at finding loopholes. Hence it is not easy to amount as the perception images, while the classifications
authenticate transactions or it is very tough for businesses of items are contemplated to be conditions of the HMM.
to authenticate transactions [3]. They arranged a strategy for finding the expense profile of
Data researchers have been quite successful in resolving cardholders, correspondingly as use of this data in deciding
this problem with machine learning and predictive ana- the value of observation symbols and estimate of the model
lytics. The problem that comes with the credit card fraud parameters. It’s is also explained that how HMM is able to
detection is the skewed or unbalanced data and the algo- detect approaching transaction as fraudulent or not. Rela-
rithms treat the minority categories as a noise and only tive investigations uncover that the Accuracy of the
predicts the majority category accurately not the minority framework is on the precarious edge of 80% over an
category [4, 5]. So in the case of a skewed data there are extensive variety inside the data. The framework is fur-
various resampling techniques which can be applied on a thermore ascendable for taking care of huge volumes of
skewed or imbalanced data and can produce better results. transactions.
The imbalanced problem of the dataset can also be solved Khare and Sait [9] examined and checked the presen-
by a technique called ensemble learning framework which tation of Decision Tree, Random Forest, SVM and Logistic
guarantees the uprightness of test features and the method Regression classifier algorithms. The methods were used
depends on training set split and congregate [6]. In order to on the raw and pre-handled information. From the inves-
overcome with the problem of the false alarm rates and tigations the outcome that has been finished up is that
increase the efficiency of credit card fraud detection rate Logistic regression has exactness of 97.7% while SVM
various approaches like outlier detection methods have indicates exactness of 97.5% and Decision tree demon-
been used [7]. These approaches can optimise the best strates exactness of 95.5% yet the best outcomes are
solution and their implementation on bank credit card fraud acquired by Random forest with an exact precision of
detection system (CCFDS) are very useful in detecting and 98.6%. The outcomes acquired therefore reason that Ran-
preventing the fraudulent transaction [7]. dom forest demonstrates the most precise and high accu-
racy of 98.6% in issue of credit card fraud detection with
dataset given by ULB.
2 Related work Banerjee et al. [10] examined the various machine
learning classifiers trained on a public dataset to analyse
Credit card fraud is on the rise as the number of online correlation of certain factors with fraudulence. The better
transactions are increasing. In order to prevent fraudulent metrics are used to determine false negative rates and the
transactions and to detect the credit card fraud there should performance of random sampling was measured to deal
be the most effective methods which are able to detect and with the class imbalance of the dataset. The support vector
prevent fraudulent transactions before they make a huge machine performed better for detecting credit card fraud
loss to the banks and credit card holders. There are various under realistic conditions. The comparison between the
methods of fraud detection which are proposed by deep learning and regression algorithmic models is done to
researchers and are somehow effective in credit card fraud determine which algorithm and combination of factors
detection but the real problem lies in availability of data- provides the most accurate method of classifying a credit-
sets due to security issues and datasets are very imbal- card transaction as fraudulent or non-fraudulent. The best
anced. Some of the methods proposed for credit card fraud algorithm for analysis of datasets with a close to 1:1 ratio
detection are Neural Networks, Fusion of Dempster Shafer, of fraudulent to non-fraudulent transactions is the Random
Bayesian Learning, Hidden Markow Model, Fuzzy Dar- Forest Classifier, assuming the fraud-to-not fraud distri-
winian System, Outlier detection methods, Support Vector bution of the testing and training set is the same.
Machines, Genetic Algorithm, Covering Algorithm, Meta- Mishra and Ghorpade [4] analysed various classification
Classifiers, Data Mining, ensemble Learning, machine techniques using various metrics for evaluating various
123
Int. j. inf. tecnol.
classifiers. The models were trained based on various Testing datasets are assessed using trained model of the
classification and ensembling techniques. The models used classifiers. Step by step methodology has been explained as
were Logistic regression, Decision tree, Random Forest, follows. Figure 1 explains the flow diagram of research
Support Vector machines and various ensembling models. work.
These models were trained and the results were obtained.
Results obtained from the actual dataset were also good 3.1 Collection of dataset and pre-processing
and with the recall of about 96% the Random Forest
classifier performed better as compared to other classifiers. The dataset is acquired from the Kaggle which hosts the
Xuan et al. [11] utilized two sorts of random forest dataset from credit card fraud detections [12]. The dataset
algorithms to prepare the features of typical and strange is crafted from the MasterCard transactions of European
transactions. The two arbitrary random forest algorithms cardholders on Sept 2013. The transactions that occurred
utilized are thought about which are distinctive in their for 2 days were recorded that amounts to 284,807 entries.
base classifiers and their presentation is examined on credit The positive category (fraud cases) conjure 0.172% of the
card fraud detection. The two algorithms utilized are ran- transactions information. The features are transformed and
dom-tree-based random forests and CART-based random are reduced to 28 principal components as PCA is applied
forest whose preparation set originates from bootstrapped on them and are transformed into numerical input values.
tests. The three experiments were performed for the two These principal components are named as V1, V2, V3 …
algorithms on different datasets with different proportions and V28. The features include credit limit, gender, marital
of datasets. The performance of these algorithms were status, previous months bills, previous months payments,
measured for all the three experiments and the metrics status of existing account, salary assignments, credit his-
which were added are intervention rate of transaction and tory, other credits existing, purpose, credit amount, present
average rate of model. Cart based random forest performed employment, savings account, personal status, other debt-
better in all the experiments performed. ors, property, age in months, Housing, number of existing
credits, Job, Telephone, foreign worker, ID, Credit card
number, PIN, Time, Amount and Class. From the statistics
3 Methodology of total entries and fraud cases it can be inferred that the
dataset is very unbalanced and is inclined towards the
In this research work the machine learning classifiers
namely: logistic regression (LR), K-nearest neighbour
(KNN) and Naı̈ve Bayes (NB)are put to application with
Python serving as the language of implementation. The
experiments are carried out and the evaluation of these
experiments is done using the confusion matrix and per-
formance comparison of the algorithms is analysed with
the help of measures namely: accuracy, sensitivity, speci-
ficity, precision, F-measure and area under curve (AUC).
There are various stages which are involved in creating
and processing of classifiers which include; gathering of
data, pre-processing of data, training of algorithms, testing
of algorithms and analysis of classifiers.
At the time of pre-processing of data, the data is trans-
formed into viable format and is then sampled using the
sampling techniques. A technique called random under-
sampling is carried out on a dataset because of the highly
imbalanced dataset which is more biased towards the
negative cases (non-fraud cases) and due to the under-
sampling of dataset, three sets of data distribution is
achieved. The features selected are the principal compo-
nents and these components are actually the product of
principal component analysis dimensionality reduction
resulting in 28 principal components which are represented
as V1, V2 …, and V28. During the training stage the
algorithms are given the input as the processed data. Fig. 1 Flow diagram of research work
123
Int. j. inf. tecnol.
negative class. The background details of the features are 3.3 Classification techniques
hidden and cannot be shown due to privacy issues. The
time contains the seconds passed between every exchange In the dataset of credit cards there are two values for
and the primary exchange in the dataset. The ‘Amount’ classification of transactions which means that it is a binary
feature is the exchange amount. The ‘class’ feature is used classification problem where transactions are classified
to represent whether the transaction is fraud or non-fraud either as fraud (1) or non-fraud (0). After resampling of the
and for the class value of 1 it represents the fraud trans- data by under-sampling, the classifiers are trained using the
action in the dataset and for the class value of 0 it repre- training data to evaluate the methods. In this study classi-
sents the non-fraud transactions. fication techniques named as: logistic regression (LR),
Naı̈ve Bayes (NB) and K-nearest neighbour (KNN) are
3.2 Under-sampling of dataset used.
123
Int. j. inf. tecnol.
123
Int. j. inf. tecnol.
its value. The best value for f1 score is considered at value proportions as compared to other two algorithms in Fig. 6.
1 [14]. In 25:75 ratio represented by ‘C’ all the algorithms per-
Precision:Recall formed better in term of accuracy measurement. So this
F-measure ¼ 2: ð5Þ split or ratio is considered to be the best for further training
Precision þ Recall
and testing purpose. Same results with better performance
Area under curve (AUC) AUC represents degree or of Logistic Regression can be seen for other performance
measure of separability that is how much model is capable measurement parameters (precision and F-measure) in
of differentiating between the classes [14]. Figs. 7 and 8 respectively.
1
AUC ¼ :ðSensitivity þ SpecificityÞ ð6Þ
2 4.1 Analysis of algorithms for the three proportions
Table 6 Comparison of
Resampling method: random under-sampling
classification techniques by
ratio A Techniques Sensitivity Specificity Accuracy Precision F-measure AUC
123
Int. j. inf. tecnol.
Table 7 Comparison of
Resampling method: random Under-Sampling
classification techniques by
ratio B Techniques Sensitivity Specificity Accuracy Precision F-Measure AUC
Sensitivity Accuracy
1 1.2
0.9
1
0.8
0.7 0.8 Logistic
Logistic
0.6 Regression Regression
0.5 Naïve Bayes 0.6 Naïve Bayes
0.4
K-Nearest 0.4 K-Nearest
0.3 Neighbour Neighbour
0.2 0.2
0.1
0 0
A B C A B C
Specificity
Precision
1.2
1.2
1 1
0.2
0
A B C
0
A B C Fig. 7 Precision
Fig. 5 Specificity
123
Int. j. inf. tecnol.
for respectively three different ratios of 50:50, 34:66 and 5 Conclusion and future work
25:75. Also Naı̈ve Bayes algorithm has a greater bias but
lower variance than logistic regression which might be The research work was carried out with the purpose of
support the under sampling methodology of data balancing. comparing the ability of machine learning algorithms as to
K-nearest neighbour requires a distance or measure the how accurately they differentiate and classify the fraud and
separation characterized between two information. In pro- non-fraud transactions of the credit card dataset with ran-
cedure of KNN, it characterizes any approaching transac- dom under sampling method (RUS) and to check out if the
tion by ascertaining separation of closest point to new performance is improved or not. Logistic Regression (LR)
approaching transaction. At that point if the closest showed the optimal performance for all the data propor-
neighbour be deceitful, then the transaction demonstrates tions as compared to Naı̈ve Bayes (NB) and K-Nearest
as a fraud. The estimation of K is utilized as, a little and Neighbour (KNN). LR was successful in getting higher
odd to break the ties (normally 1, 3 or 5). Bigger K values accuracy as compared to Naı̈ve Bayes and KNN. The LR
can lessen the impact of boisterous dataset. In this algo- showed the maximum accuracy of 95%, NB showed 91%
rithm, distance between two information instances is and KNN 75%. Also LR technique shows the better Sen-
determined utilizing Euclidean distance. For multivariate stivity, Specificity, Precision and F-Measure as compare to
information, distance is typically determined for each NB and K-NN technique. It has also been observed that
instance and after that consolidated. The algorithm shows being a supervised techniques (LR and Naı̈ve Bayes) shows
the poor accuracy for the proportion C (50:50) in visual- a better results in each case as compared to un-supervised
ising the results from bar graphs in Figs. 4, 5, 6, 7 and 8 technique K-NN.
and this is due to the small sample of training data as there There can be other resampling methods as well which
is much more similarity between the fraud and non-fraud could be put to application for the skewed dataset for credit
cases and the algorithm does not efficiently differentiates card fraud detection (CCFD). The resampling methods
the patterns in fraud and non-fraud cases. could be improved to get better results. Also using our
As it is clear from the results of the experiments that the statistics could be compared with the other techniques like
accuracy of classifiers increases as the training data is Random-Forest, SVC, Decision-Tress, Neural Network and
increased. Table 6 summarises the information for the ratio Genetic Algorithm. The main limitation of Random Under-
34:66 for the Random Under-sampling (RUS). In this sampling is that some information could be lost and new
proportion logistic regression and Naı̈ve Bayes have the resampling methods could be devised for achieving opti-
same specificity rate which is 1.0 and it means that both the mal results which can prove helpful in credit card fraud
classifiers classified the negative cases (non-fraud) with detection (CCFD) in future. Likewise our results might be
100% accuracy. This might be due to reason, as the training useful and can offer further help to the association to
data sample increases the accuracy of both the classifiers assemble a vastly improved credit card fraud detection
increases. Since, both the algorithms estimate the priori system (CCFDS) which can be better in dealing with the
probability which increases with number of samples in the skewed information and utilize the better measurements to
training data and thus helps in classifying the data samples assess the outcomes.
more accurately. It is depicted by the AUC values which is
0.89 and 0.85 for logistic regression and Naı̈ve Bayes
123
Int. j. inf. tecnol.
123