Analysis and Prediction For Credit Card Fraud
Analysis and Prediction For Credit Card Fraud
Analysis and Prediction For Credit Card Fraud
Arun, G. K., & Rajesh, P. (2022). Analysis and prediction for credit card fraud detection dataset using
data mining approaches. International Journal of Health Sciences, 6(S5), 4155–4173.
https://doi.org/10.53730/ijhs.v6nS5.9534
G. K. Arun
Department of Computer and Information Science, Annamalai University,
Annamalai Nagar – 608 002, Tamil Nadu, India.
Email: [email protected]
P. Rajesh
PG Department of Computer Science, Government Arts College, Chidambaram –
608 102 (Deputed from Department of Computer and Information Science,
Annamalai University, Annamalai Nagar), Tamil Nadu, India
Corresponding Author: [email protected]
I. Introduction
Data mining is the discovery of information or models for pre-existing data. The
corresponding models made up of one of several things. In data analytics
research, the statisticians were the first to use the term “Data Mining”. The term
data mining was a critical term mentioning to challenges to retrieve hidden
information that was not stayed by the data. The statistical based researchers
view data mining concepts as the creation of a statistical model, that is, a
fundamental probability distribution from which the visible data is drawn [1]. The
credit card customers and industry has considered computing statistical models
for automation. Recently, these types of statistical models have been used to
various areas likes academic research, especially with respect to share markets
and e-commerce. The credit card fraud-detection domain is s growing field in
Covid periods and presents many challenging and issues for using data mining
techniques [2].
The global e-commerce will predict to reach almost 7k billion US dollars in 2023
as shown in Fig. 1. E-commerce frauds mainly focused on credit cards as a
means of payment through credit card adoption. The credit card fraud ratio is the
same or perhaps slightly growing indicted worldwide as shown in Fig. 2 [3]. The
annual value of consumers losses on card-not-present (CNP) fraud for debit and
credit cards issued in the United Kingdom (UK) alone for 2019 amounted to 470.2
million GBP [4]. Over the years, technology has significantly changed, and so have
the fraud patterns. Today, CNP is a dominant type of fraud, as visible in Fig. 2,
and it is reported that [5]: CNP fraud accounted for 1.43 billion in fraud losses in
2018 that means increase of 17.7% compared with 2017 [6].
Credit card fraud detection is one of the most essential research area discovered
domains of fraud detection and its depend on the computerization recorded
transactions to detect fraudulent behavior. Every time a credit card is used,
transaction data, composed of several attributes (e.g., credit card identifier,
transaction date, recipient, amount of the transaction), are stored in the
databases of the service provider. However, a single transaction information is
typically not sufficient to detect a fraud occurrence and the analysis must
consider aggregate measures like total spent per day, transaction number per
week or average amount of a transaction [7].
Data mining technique used in solving unsolvable problems in the field of credit
fraud detection problem. Credit card fraud detection mainly used to identify those
transactions that are fraudulent into two main classes of valid (genuine) and
fraudulent transactions. Credit card fraud finding is constructed on analysis of a
card’s spending behavior. Many techniques have been applied to credit card fraud
detection, artificial neural network, genetic algorithm, support vector machine,
frequent itemset mining, decision tree, migrating bird’s optimization algorithm
and naïve bayes [8].
Now a days most of the research compare their proposed related review of
literature and its algorithms with a benchmark algorithm. The comparison using
different standard binary for classification measurement called as
misclassification error, receiver operating characteristic (ROC), Kolmogorov–
Smirnov (KS), F1Score [9] or AUC statistics [10]. The above measures may not be
the most appropriate evaluation methods for evaluating the credit card fraud
detection models and assume that different misclassification errors carry the
same cost and similarly with the correct classified transactions [11]. Now a days
most of the researchers are using statistical modeling tools used to solve the task,
observe the true singularities in the real world only through a proxy given as a
restricted set of point-wise observations. In credit-card fraud detection, the true
phenomenon of interest is the genuine purchase behavior of card holders or,
likewise, the malicious behavior of fraudsters [12]. In our experiment, we use
4157
random forest as a classifier [13]. The popularity of decision tree models in data
mining is owed to their simplification in algorithm and flexibility in handling
different data attribute types [14]. However, single-tree model is possibly sensitive
to specific training data and easy to overfit [15] and [16].
Seven classification algorithms such as J48, Random Tree (RT), Decision Stump
(DS), Logistic Model Tree (LMT), Hoeffding Tree (HT), Reduce Error Pruning (REP)
and Random Forest (RF) are used to measure the accuracy. Data mining and
machine learning techniques build with WEKA (Waikato Environment for
Knowledge Analysis) tools which is used to finding experimental results of
weather related dataset. Out of seven classification algorithms, Random tree
algorithm outperforms other algorithms by yielding an accuracy of 85.714% [17]
and the authors discuss another paper clearly data mining with decision tree
approaches using medical related data [18].
The pandemic situations, the credit card companies can diagnose fraudulent
credit card transactions so that customers are not exciting for items that they did
not obtaining. The following Fig. 1 and Fig. 2 indicate credit card fraud detection
increase year by year and explain evaluation of the total value of credit card fraud
within single euro payments area. Finally, it is also vital for businesses not to
detect transactions which are honest as fraudulent, if not, the companies would
keep blocking the credit card, and which may lead to client disappointment. So at
this time are two important expects of this analysis:
• What would happen when the company will not be able to detect the
fraudulent transaction and would not confirm from a customer about this
recent transaction whether it was made by him/her.
• In contract, what would happen when the company will detect a genuine
transaction as fraudulent and keep calling customer for confirmation or
might block the card.
The datasets [20] contain transactions that have 492 frauds out of 284,807
transactions. So, the dataset is highly unbalanced, the positive class (frauds)
account for 0.172% of all transactions. When we try to build the prediction model
with this kind of unbalanced dataset, then the model will be more inclined
towards to detect new unseen transaction as genuine as our dataset contains
about 99% genuine data. As our credit card dataset is highly imbalanced, use
accuracy score as a metric because it will be usually high and misleading, we
should focus on f1-score, precision/recall score or confusion matrix.
Due to confidentiality issue, original features v1, v2, ... v28 have been transformed
with PCA, however, we may guess that these features might be originally credit
4158
In statistical analysis, linear regression is a basic and most of the researchers are
commonly used prediction problems. The overall interpretation of regression
analysis is to clearly be explained two types of things, set of predictor variables do
amazing jobs for predicting the futures based on dependent variables and
variables are important predictors of the outcome variable. This regression which
is used to explain the relationship between one dependent variable and one or
more independent variables.
B. Symbolic Representation
C. Correlation Coefficient
The linear correlation coefficient which is used to the given data that measures
the strength of the given data and the linear relationship between two variables
namely x and y. The linear correlation coefficient indicates the sign and direction
of the linear relationship between x and y. When r nothing but correlation
4159
coefficient and the range of near 1 to −1. The correlation coefficient equation
returns near 1 which means linear relationship is strong and the correlation
coefficient return near 0, the linear relationship is weak. The linear correlation
coefficient equation is presented in equation (2).
𝑛(∑ 𝑥𝑦)−(∑ 𝑥) (∑ 𝑦)
𝑟 = (2)
√[𝑛 ∑ 𝑥 2 −(∑ 𝑥)2 ] [𝑛 ∑ 𝑦 2 −(∑ 𝑦)2 ]
D. Decision Tree
Random tree is easy to intellectualize but will naturally suffer from extreme
variance, which makes them not good in terms of precision. One way to affect this
restriction is to make several alternatives of a single decision tree by choosing
every time a separate subset of the same training set in the context of
randomization-based ensemble methods. Random Trees (RT) belong to a class of
machine learning algorithms which is used to makes predictions by averaging
over the predictions of several independent base models. REP decision tree is a
method for finding to generate a tree from a given dataset. It is careful to be an
extension of C45 decision tree approach by enhancing the pruning stage by using
Reduced Error Pruning (REP).
In statistical analysis the mean absolute error is used to measure the model error
between paired observations expressing the similar singularity. For statistical
examples of ‘y’ and ‘x’ include the comparisons of predicted versus observed
values and subsequent time versus initial time. The mean absolute error which is
used to an alternative technique of measurement.
∑𝑛
𝑖=1 |𝑦𝑖 −𝑥𝑖 |
𝑀𝐴𝐸 = (3)
𝑛
Where, yi denoted as prediction and xi means true values and ‘n’ denoted as total
number of data points or number of observations.
4160
Root mean square error is the standard deviation of the residuals called
prediction errors. Residuals is used to measure of how far from the regression line
data points presented. RMSE is a measure of how spread out these residuals and
used to find it tells you how concentrated the data is around the line of best fit.
The RMSE mainly used for forecasting or predicting the future of commonly used
in climatology, forecasting, and regression analysis to verify experimental results.
∑𝑁 ̂𝑖 )2
𝑖=1(𝑥𝑖 −𝑥
𝑅𝑀𝑆𝐷 = √ (4)
𝑁
Where ‘i’ means individual variable in the corresponding columns, ‘N’ means
number of non-missing data points, 𝑥𝑖 means actual observations of time series
data and 𝑥̂𝑖 means estimated times series.
7000 6512
worldwide (billion U.S.
5695
6000 4927
e-commerce sales
5000 4206
3535
dollars)
4000 2982
3000 2382
1845
2000 1336 1548
1000
0
2014 2015 2016 2017 2018 2019 2020 2021 2022 2023
Year
Fig. 1 E-commerce sales worldwide from 2014 to 2023 (billion U.S. dollars)
Fig. 2 Evolution of the total value of credit card fraud within Single Euro
Payments Area
4161
Features Description
C1 Credit card transaction time in seconds
C2-C29 Confidentiality features which are indicated v1 to v28
C30 Transferred amount
Class which is indicated whether the transaction called fraud or
C31
not means (1 and 0)
As our credit card dataset is highly imbalanced, use accuracy score as a metric
because it will be usually high and misleading, we should focus on f1-score,
precision/recall score or confusion matrix.
Table 2. Credit card fraud detection dataset with time, confidentiality features and
transferred amount
𝑢𝐴 −𝑢𝐸
=| 𝑢𝐸
| . 100% (5)
Where means percentage of the error occurs in the corresponding model, 𝑢𝐴
actual value observed and 𝑢𝐸 expected values.
4162
The root relative squared error is used to find the relative to a simple predictor
had been used. The simple predictor and find the average of the actual values.
Based on statistical analysis, the root relative squared error Ei of each individual
model and i is evaluated by the RRSE equation.
2
∑𝑛
𝑗=1(𝑃(𝑖𝑗) −𝑇𝑗)
𝐸𝑖 = √ 2 (6)
∑𝑛 ̅
𝑗=1(𝑇𝑗 −𝑇 )
1
𝑇̅= 𝑛 ∑𝑛𝑗=1 𝑇𝑗
For RRSE called a perfect fit, the numerator is equal to 0 and Ei = 0. The Ei index
ranges between 0 to infinity, with 0 corresponding to the ideal for the
corresponding model.
I. Data Normalization
V. Experimental
The current situations to increase most of the people for using debit/credit card
in day-to-day activities. In this manner for using cards trust & safety is a major
problem for using all platform. Fraud detection and prevention at every step of the
credit/debit card user journey. The above dataset could be inferred that the 28
anonymized features and 2 non anonymized features such as amount and Class
which means whether the transaction was a fraud or not. The statistical
description and summary show in Table 1 having some preliminary statistical
measurements.
4163
Descriptive Statistics Time () Confidentiality () Amount () Class ()
Minimum 0 -7.084 0 0
Maximum 172792 0.802 25691.160 1
Mean 94813.860 0 88.350 0.002
Standard Deviation 47488.150 0.198 250.120 0.042
Table 4b. Multiple linear regression approaches and its accuracy parameters
Root Root
Time Mean Relative
Name of the Correlation Mean Relative
taken Absolute Absolute
Parameters Coefficient Squared Squared
(Seconds) Error Error (%)
Error Error (%)
Time () 0.36 0.16 41912.04 46887.85 97.93 98.74
Confidentiality () 0.22 0.42 0.12 0.1797 93.87 90.77
Amount () 0.21 0.25 102.25 242.07 98.76 96.78
Class () 0.21 0.33 0.01 0.04 291.54 94.46
Table 5. Finding accuracy for credit card fraud detection data using normalization
Time
Relative
Decision Size of taken to Mean Root Mean Root Relative
Correlation Absolute
Tree the build Absolute Squared Squared
Coefficient Error
Classifier Tree model Error Error Error (%)
(%)
(Seconds)
M5 240 12.05 0.5744 0.0023 0.0340 66.5190 81.8706
Random 100
171.03 0.6246 0.0019 0.0325 56.5373 78.3061
Forest
Random Tree 1137 3.23 0.4438 0.0019 0.0434 54.6672 104.5627
REP Tree 77 3.19 0.5585 0.0022 0.0346 64.1355 83.3015
Fig. 7 Decision tree comparison using tree size to build the tree
4168
For using Table 2 and the datasets [20] contain transactions that have 492 frauds
out of 284,807 transactions. In this case, the fraud transactions occur 0.1954%
and the genuine transaction transactions occurs 99.8040%. Based on the
descriptive statistics, mean is the average samples of credit card transactions
level observed data.
The entire dataset having the average amount of transactions Rs. 88.35 and the
standard deviation which is used to find the amount of variation or dispersion of
amount Rs. 250.12. The maximum amount of credit card transition Rs. 25691.16
and the minimum Rs. 0, the numerical illustrations show in Table 3, Fig. 3 and
Fig. 4 based on time and amount of transactions. Clustering analysis for credit
card fraud detection using four parameters namely time, confidentiality, amount,
and class, the corresponding clustering analysis shows in Fig. 4.
Multiple linear regression which is used to predict the future with symbolic
notations using equation (1). In this case, the dataset having four types of
parameters namely time (), confidentiality (), amount () and class (). MLE used
to finding forecasting time, confidentiality, amount and class equations, the
corresponding equations shows in Table 4. Fig. 5, indicate the plot matrix which
is used to understanding the clustering pattern in various combinations of the
parameters namely time (), confidentiality (), amount () and class (), the
corresponding diagrammatical representation shows in Fig. 5.
different scale of data namely second, percentage and error terms shows in Table
4. In this case different scales of data not suitable for numerical analysis, in this
case various scales of data converted into similar scale using the familiar
normalization equation (7) and show the corresponding normalized valued present
in Table 5 and Fig. 6. The normalization techniques are very essential for
researchers in data mining and data analytics, and it is used to very helpful to
write the necessary interpretation.
References