0% found this document useful (0 votes)
18 views

Data Quality Analysis Based Machine Learning Model

Uploaded by

Ngọc Quỳnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Data Quality Analysis Based Machine Learning Model

Uploaded by

Ngọc Quỳnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/352196647

Data Quality Analysis based Machine Learning models for Credit Card Fraud
Detection

Article · June 2021


DOI: 10.51201/JUSST/21/05263

CITATIONS READS
0 620

2 authors:

Amit Pundir Rajesh Pandey


Shobhit University Shobhit University, India, Meerut
1 PUBLICATION 0 CITATIONS 17 PUBLICATIONS 19 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Rajesh Pandey on 20 November 2021.

The user has requested enhancement of the downloaded file.


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

Data Quality Analysis based Machine Learning models for


Credit Card Fraud Detection
1
Amit Pundir, 2Rajesh Pandey
1
Research Scholar, Shobhit Institute of Engineering & Technology, Meerut, (Uttar Pradesh), India
2
Assistant Professor, Shobhit Institute of Engineering & Technology, Meerut, (Uttar Pradesh), India
1
[email protected], [email protected]

Abstract

Misrepresentation of money is a developing issue in monetary business with far-reaching consequences


and keeping in mind that many processes have been found. A data quality management with data mining
has been effectively applied to data sets to mechanize the investigation of massive amounts of complex
information. Data mining has likewise played a notable role in identifying credit card fraud in online
exchanges. Fraud detection in credit card is a data quality management issue that considered under data
mining, tested for two important reasons — first, the profiles of ordinary and false practices habitually
change, and also because of the explanation that charge card fraud information is exceptionally slow. This
research paper examines the performance of Decision Trees, Logistics Regression, and Random Forest
relies strategically on profoundly skewed credit card fraud data. The dataset of credit card transaction is
sourced from Kaggle (a publically accessible dataset repository) with 284,807 transactions. These
methods are applied to raw data values and data preprocessing techniques. Assessment of the
performance of techniques depends on accuracy, sensitivity, specificity, precision, and recall. Results
indicate the optimal accuracy for the decision trees, logistics regression, and random forest classifiers
with 90.8%, 98.5%, and 99.1% respectively.

Keywords

Fraud in credit card, data mining, logistic regression, decision tree, random forest, collative analysis

1. Introduction

Monetary rising is a developing concern with widespread results in public authorities, corporate
associations, the currency industry, in today's reality the high reliance on web innovation has delighted in
expanded credit card transactions, yet the credit card fraud or misinterpretation of charge cards has led to
the online and offline transaction. As credit card transaction becomes an unavoidable method of
installment payments, focus has been given to recent computational methodologies to handle the credit

Volume 23, Issue 6, June - 2021 Page -318


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

card fraud problem. There are many misrepresentation/fraud detection mechanisms and programming
based solution that are used in organizations to prevent from frauds, for example, credit cards, retail,
online trade, e-commerce , insurance and enterprise. Data quality management under the data mining
technique is a notable and well-known technique used in the care of credit fraud detection problems. It is
impossible to be sheer certain about the true intention and rightfulness behind an application or
transaction. In reality, to seek out possible evidences of fraud from the available data using mathematical
algorithms is the best effective option. Fraud detection in credit card is the truly the process of identifying
those transactions that are fraudulent into two classes of legit class and fraud class transactions, several
techniques are designed and implemented to solve to credit card fraud detection such as genetic
algorithm, artificial neural network frequent item set mining, machine learning algorithms, migrating
birds optimization algorithm, comparative analysis of logistic regression, SVM, decision tree and random
forest is carried out. Credit card fraud detection is a very popular but also a difficult problem to solve.
Firstly, due to issue of having only a limited amount of data, credit card makes it challenging to match a
pattern for dataset. Secondly, there can be many entries in dataset with truncations of fraudsters which
also will fit a pattern of legitimate behavior. Also the problem has many constraints. Firstly, data sets are
not easily accessible for public and the results of researches are often hidden and censored, making the
results inaccessible and due to this it is challenging to benchmarking for the models built. Datasets in
previous researches with real data in the literature is nowhere mentioned. Secondly, the improvement of
methods is more difficult by the fact that the security concern imposes a limitation to exchange of ideas
and methods in fraud detection, and especially in credit card fraud detection. Lastly, the data sets are
continuously evolving and changing making the profiles of normal and fraudulent behaviors always
different that is the legit transaction in the past may be a fraud in present or vice versa. This paper
evaluates four advanced data mining approaches, Decision tree, support vector machines, Logistic
regression and random forests and then a collative comparison is made to evaluate that which model
performed best.

Credit card transaction datasets are rarely available, highly imbalanced and skewed. Optimal feature
(variables) selection for the models, suitable metric is most important part of data mining to evaluate
performance of techniques on skewed credit card fraud data. A number of challenges are associated with
credit card detection, namely fraudulent behavior profile is dynamic, that is fraudulent transactions tend to
look like legitimate ones, Credit card fraud detection performance is greatly affected by type of sampling
approach used, selection of variables and detection technique used. In the end of this paper, conclusions
about results of classifier evaluative testing are made and collated.

Volume 23, Issue 6, June - 2021 Page -319


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

From the experiments the result that has been concluded is that Logistic regression has an accuracy of
97.7% while SVM shows accuracy of 97.5% and Decision tree shows accuracy of 95.5% but the best
results are obtained by Random forest with a precise accuracy of 98.6%. The results obtained thus
conclude that Random forest shows the most precise and high accuracy of 98.6% in problem of credit
card fraud detection with dataset provided by ULB machine learning.

2. Literature Review

Common algorithms and methods used in other works are presented in this section. Table 1, briefly
depicts the used methods in the researches.

(Akila & Reddy, 2018) present an ensemble model named Risk Induced Bayesian Inference Bagging
model, RIBIB. They propose a three-step approach: a bagging architecture with a constrained bag
creation method, Risk Induced Bayesian Inference method as base learner, and a weighted voting
combiner. Bagging is a process of combining multiple training datasets, and utilizing them separately to
train multiple classifier models. They evaluated their solution on Brazilian Bank Data and exceled at cost
minimizing compared to the other state-of-the-art models.

(de Sá et al., 2018) propose a customized Bayesian Network Classifier (BNC) that is automatically
created by an algorithm named Hyper-Heuristic Evolutionary Algorithm (HHEA). HHEA builds a custom
BNC algorithm, creating an ultimate combination of necessary modules for the dataset at hand. The
dataset they used is UOL PagSegure, which is an online payment service in Brazil. They evaluated their
results using F1 score and a term they called as economic efficiency, measuring the company‘s economic
loss from fraud. They also use an approach called instance reweighing while comparing their results with
other baselines. It is basically reweighing (assigning them more importance) false negatives (predicted as
legitimate but actually fraudulent), as they are more important for the payment companies (than the other
way around).

(Carcillo et al., 2019) combine supervised and unsupervised techniques and presents a hybrid approach to
fraud detection. The reason for using unsupervised learning is that fraudulent behavior changes over time
and the learner needs to consider these fraud patterns. They specify an approach to calculate outlier scores
at different granularity levels. One of them is global granularity, in which all samples of transactions are
considered in one global distribution. The other is local granularity, where outlier scores are computed
independently for each credit card. Lastly, cluster granularity is in between global and local granularity,
where customer behavior, such as amount of money spent at the last 24h is taken into account. To
implement their classifier model, they used Balanced Random Forest (BRF) algorithm. Topn Precision
and AUCPR (Area under the precision-recall curve) metrics are used for evaluation.

Volume 23, Issue 6, June - 2021 Page -320


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

Real Derived
Author/ year Algorithm Evaluation Metrics
dataset Features

(Akila & Risk Induced Bayesian Inference Cost based, FPR, FNR,

Reddy, 2018) Bagging TNR, TPR, Recall, AUC

(de Sá et al., HHEA based Bayesian Network


✔ F1, economic loss
2018) Classifier

Dynamic Random Forest (DRF), Recall, Precision, F1-


(Nami &
with minimum risk model K-Nearest ✔ ✔ measure Specificity,
Shajari, 2018)
Neighbor (KNN) Accuracy

Decision tree, Naïve Bayes, Logistic


(Deshe et al., False negative, false
regression, Random Forest, and ✔
2018) positive rate
Artificial Neural Network

(Carcillo et
K-means, Balanced Random Forest ✔ AUC-PR, Precision
al., 2019)

K-S Statistics, AUROC,


Logistic regression, decision trees,
(Kim et al., Alert rate, precision,
recurrent neural networks, ✔
2019) recall, cost reduction
convolutional neural networks
rate

Decision Trees, Naïve Bayes


Accuracy, Precision,
(Dornadula & classification, Least Squares
✔ ✔ Matthews Correlation
Geetha, 2019) Regression, Support Vector
Coefficient (MCC)
Machines (SVMs)

Prudential Multiple Consensus


model, Ensemble of Multi-layer
(Saia et al., Sensitivity, Specificity,
perceptron, Gaussian Naïve Bayes, ✔
2019) AUC, Miss rate, Fallout
Adaptive Boosting, Gradient
Boosting, Random Forest

Volume 23, Issue 6, June - 2021 Page -321


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

Sensitivity, specificity,
(Askari &
Decision tree based Intuitionistic false negative rate, false
Hussain, ✔ ✔
fuzzy logic positive rate, precision
2020)
accuracy and Fmeasure

Random Forest Classifier with


(Lucas et al.,
derived features created from ✔ ✔ Precision-Recall AUC
2020)
Hidden Markov Model (HMM)

Dimension reduction using Deep


(Misra et al., Autoencoders, MultiLayer
✔ F1-Score
2020) Perceptron, K-Nearest Neighbor,
Logistic Regression

Table 1: Examining various works for fraud detection

(Kim et al., 2019) compare two different approaches to fraud detection: Hybrid ensemble methods and
deep learning. They do this comparison in a framework named champion-challenger analysis. The
champion model is a model that is used for a while that uses various machine learning classifiers such as
decision trees, logistic regressions, and simple neural network. Each method is trained using different
samples and features and the best outcome is chosen manually by experts(Kim et al., 2019). Whereas the
challenger framework uses recent deep learning architecture consisting of convolutional neural networks,
recurrent neural networks and their variants. This framework tries each modern deep learning
architecture, specifies activation functions, dropout rates and costs, then finds the best hyperparameters. It
then uses early stopping, a way to stop training when no further improvement is achieved on the
validation set, during training. It finally chooses the best performed model and saves the hyperparameters
and complexity settings used on them and tries to find better candidates out of previous experiments.(Kim
et al., 2019)

Various evaluation metrics are used to compare each framework: K-S statistics: Maximum value of
difference between two distributions, AUROC: Area under receiver operating characteristics, plot of true
positive rate over false positive rate. Alert rate: given a cut-off value to alert by the user, transactions are
alerted over all transactions. Precision: fraudulent transactions predicted as true (TP) over all alerted
transactions (TP + FP). Recall: fraudulent transactions are predicted as true (TP) over fraudulent
transactions (TP + FN). Cost reduction rate: the missed frauds (FN) cost the transaction amount to the
owner company. It is calculated using the sum of the amount that came out as FNs. As a result, the

Volume 23, Issue 6, June - 2021 Page -322


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

challenger framework which is based on deep learning performs much better than the champion
framework.

(Askari & Hussain, 2020) developed a decision tree using intuitionistic fuzzy logic. They argue that it
recognizes the conceptual properties of attributes of transactions, so that legitimate ones are not captured
as fraud, or vice-versa. The motivation under using fuzzy logic is that it is not tried as much as the other
artificial intelligence methods on e-transactional fraud detection.

C4.5 algorithm is used with fuzzy logic and intuitionistic fuzzy logic and the final algorithm is named
IFDTC4.5. The fuzzy tests are defined by different attributes and the information gain ratio is calculated
using membership degree and non-membership degree. That information is then used to create an
intuitionistic fuzzy logic algorithm that classifies between fraud, normal, and doubtful transactions. To
evaluate the final model, almost all suitable metrics are used: Sensitivity, specificity, false negative rate,
false positive rate, precision accuracy and F-measure. They show that the proposed method outperforms
the existing techniques. Also, this algorithm is argued to work more efficiently and fast compared to
others.(Askari & Hussain, 2020)

(Pourhabibi et al., 2020) review graph-based anomaly detections between years 2007-18. They declare
that the general approach is to do the right feature engineering and graph embedding into a feature space,
so that the machine learning models could be built. They also argue that graph-based anomaly detection
techniques have been on the raise since 2017.(Pourhabibi et al., 2020)

As it could be guessed, credit card transactions are not independent events that are isolated; instead, they
are a sequence of transactions. (Lucas et al., 2020) take this property into account and create Hidden
Markov Model (HMM) to map a current transaction to its previous transactions, extract derived features,
and use those features to come up with a Random Forest classifier for fraud detection. The features
created by HMM quantify how similar a sequence is to a past sequence of a cardholder or terminal. They
evaluate the final model using Precision-Recall AUC metric and showed that feature engineering with
HMM presents an acceptable rise in the PR-AUC score.

(Misra et al., 2020) propose a two-stage model for credit card fraud detection. First, an autoencoder is
used to reduce the dimensions so that the transaction attributes are transformed into a lower dimension
feature vector. Then, the final feature vector is sent to a supervised classifier as an input. An autoencoder
is a type of a feed-forward neural network. It regularly has the same input and output dimensions, yet
there exists a reconstruction phase in-between. Initially, there is an encoder that transforms the input to a
lower dimension, then the encoder‘s output tries to construct the output layer with the same dimension as
the input layer. That step is performed by the decoder. In this work, only the encoder part of the auto-

Volume 23, Issue 6, June - 2021 Page -323


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

encoder is used. Subsequently, the output from the encoder is used as an input to a number of classifiers:
Multi-Layer perceptron, k-nearest neighbors, and logistic regression. F1 score is used to evaluate the final
classifier. It outperforms similar methods in terms of F1.(Misra et al., 2020)

(Dornadula & Geetha, 2019) discuss that card transactions are frequently not similar to the past
transactions made by the same cardholder. Therefore, they first group the cardholders based on their
transaction amount: High, medium, and low range partitions. Afterwards, they extract some extra features
based on these groups using the sliding-window method. Next, SMOTE (Synthetic Minority Over-
Sampling Technique) operation is performed on the dataset to solve the imbalance dataset problem.
Precision and MCC (Matthews Correlation Coefficient) measures are used to evaluate the model. Among
various classifiers, logistic regression, decision tree and random forest models perform well based on the
evaluation metrics.(Dornadula & Geetha, 2019)

(Saia et al., 2019) consolidate state-of-the-art classification algorithms with a model called Prudential
Multiple Consensus. The idea is built upon the fact that the results of different classifiers are not the same
in terms of certain transactions. The algorithm is formed of two steps:

1) A transaction is perceived as legitimate if and only if the current algorithm classifies it as


legitimate and the classification probability is above the average of all algorithms. Otherwise, it is
perceived as fraudulent.
2) Majority voting is applied after all algorithms run the first step and the final decision are made.

Sensitivity, fallout, and AUC evaluation metrics are used to evaluate the model in terms of combination
of a number of algorithms such as Multi-layer perceptron, Gaussian Naïve Bayes, Adaptive Boosting,
Gradient Boosting, and Random Forest. It performs well in terms of Sensitivity and AUC.

(Nami & Shajari, 2018) present a two-stage solution to the problem. Before starting the algorithm steps,
they derive some extra features to acquire an enhanced understanding of cardholders‘ spending behavior.
Then, at the first stage, reasoning that new attitudes of cardholders would be closer to their recent
attitudes, a new similarity measure is constructed based on transaction time. That measure naturally
designates more weight to recent transactions. The second stage consists of training a Dynamic Random
Forest algorithm applying a minimum risk model. It is a model to decide the outcome of a transaction
with a cost-sensitive approach. (Nami & Shajari, 2018) tested their model using various metrics such as
recall, precision, f-measure specificity, and accuracy and showed that the minimum risk approach made
an increase in performance.

(Deshe et al., 2018) combine machine learning algorithms with customer incentives. They argue that there
must be secondary verification in order to achieve more accurate results. The secondary verification could

Volume 23, Issue 6, June - 2021 Page -324


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

be applied to certain transactions that are higher than a threshold value. They specify the strategies and
their conditions according to the benefits they offer to retailers, card issuers, and consumers, resulting in a
―win-win-win‖ success. The existing strategies generally are no-prevention (doing nothing), and using
machine learning techniques for all transactions. The third strategy is to make a second verification with
customer incentives. They experiment the different strategies with algorithms Decision tree, Naïve Bayes,
Logistic regression, Random Forest, and Artificial Neural Network.

3. Data Quality Management Common Challenges

The credit card fraud detection problem shares some common challenges to consider while implementing
efficient machine learning algorithms: They could be grouped as overcoming imbalanced dataset
problem, doing the right feature engineering, and executing models in real-time scenarios.

3.1. Imbalanced Dataset Problem

Almost all datasets of banks or other organizations contain millions of transactions, and all of them share
a common problem in terms of state-of-the-art machine learning algorithms: Imbalanced dataset. The
problem arises from the fact that the rate of actual fraud transactions out of all transactions is nominal.
The number of legitimate transactions per day in 2017 completed by Tier-1 issuers is 5.7m, whereas fraud
transactions in the same category are 1150.(Ryman-Tubb et al., 2018) This unbalanced data distribution
lessens the effectiveness of machine learning models.(Japkowicz & Stephen, 2002) Hence, training
models to detect fraudulent transactions that are very nominal requires extra caution and thinking. The
general known approaches to the imbalanced dataset problem are categorized into two: Sampling
methods, and cost-based methods.(Dal Pozzolo et al., 2017) We examine how state-of-the-art research
tackles this problem.

(Fiore et al., 2019) solve the problem by increasing the number of ―interesting but underrepresented‖
instances in the training set. They achieve this by generating credible examples using Generative
Adversarial Networks (GANs) that mimics ―interesting‖ class examples as much as possible. From the
point of view of sensitivity rate, the classifier generated by the help of GANs gives sufficient results
compared to the original classifier.

(Rtayli & Enneya, 2020) state that the quantity of transactions that are fraudulent is a very small portion
of total transactions, and that brings out the imbalanced dataset problem. To solve that issue, they use
Random Forest Classifier to select only relevant features. They use this approach in the area of Credit
Risk Identification and it gives accurate results based on the metrics they use; this approach could also be
used in the credit card fraud detection.

Volume 23, Issue 6, June - 2021 Page -325


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

(Zeager et al., 2017) state that there are common approaches to overcome class imbalance which are
oversampling the minatory class (fraudulent transactions), undersampling the majority class (legitimate
transactions), and cost-sensitive cost functions. They utilize an oversampling approach named SMOTE
(synthetic minority oversampling technique), that generates synthetic examples of fraudulent transactions.

(Jurgovsky et al., 2018) on the other hand, present a different approach to and employ and undersampling
an account level to overcome class imbalance. In depth, they tag accounts that contain at least one
fraudulent transaction as ―compromised‖, and tag accounts that do not contain any fraudulent transactions
as ―genuine‖. With a probability of 0.9, they randomly pick a genuine account, and pick a compromised
account with a probability of 0.1. The process is repeated times to create training set.

(Zhu et al., 2020) suggest an approach called Weighted Extreme Learning Machine (WELM) to solve
imbalanced dataset problems. WELM is a transformed version of ELM for imbalanced datasets assigning
different weights to different types of samples.

3.2. Feature Engineering Challenge

The pure transaction information extracted from the organization database is quite restricted. The balance
of cardholder, transaction time, credit limit, transaction amount is some of them. When only these ready-
to-use features are used to train common machine learning algorithms, the performance is not likely to
vary among them. In order to create a difference, accurate feature engineering becomes a must. We will
look through some research that handle feature engineering in credit card fraud detection scenarios.

(Zhang et al., 2019) generate a feature engineering method that is dependent on homogeneity-oriented
behavior analysis, stating that behavior analysis should be done on distinct groups of transactions with the
same transaction characteristics. These characteristics could be extracted from the information of time,
geographic space, transaction amount, and transaction frequency. For each characteristic found, two
strategies are processed for feature engineering: Transaction aggregation and rule-based strategy.

(Roy et al., 2018) extend the baseline features and add the following new features to their model:
Frequency of transactions per month, filling the missing data dummy variables, maximum, mean
authorization amounts in the 8-month period, new variables to indicate when a transaction is made at a
predefined location such as restaurants, gas stations etc., a new variable demonstrating whether a
transaction amount in a given retailer is greater than 10% of the standard deviation of the mean of
legitimate transactions for that retailer.

(Chouiekh & Haj, 2018) utilize Convolutional Neural Networks in fraud detection analysis and they argue
that since deep learning algorithms use deep architecture internally, they extract their features

Volume 23, Issue 6, June - 2021 Page -326


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

automatically in a hierarchical way with layers navigating from bottom to upwards. With that way, a
feature engineering process that is time and resource consuming is avoided.(Chouiekh & Haj, 2018)

(Wu et al., 2019) are interested in credit cards rather than transactions for feature engineering. They
mainly focus on the credit card cash-out problem. It is a fraud technique that spends all limits on the
credit card. The study incorporates additional features into the model by receiving information from
industry experts, tips shared by fraudsters online, reports, and news. The number of total features the
study reaches is 521, creating a pool for feature selection studies. The classifier model created using these
feature sets increases the precision performance by 4.6%-8.1%.

3.3. Real Time Working Problems

Since the incoming transactions that are processed to the system every day are excessive, and behaviors
of cardholders and fraudsters could change in a rapid way, the classification models should be regenerated
frequently. This exposes the question of how efficient the created models are. We investigate some
research done that tries to implement efficient systems to work in a real-time manner.

(Carcillo et al., 2018) utilize open source big data tools such as Apache Spark, Kafka and Cassandra to
create a real-time fraud detector called Scalable Real-time Fraud Finder (SCARFF). They emphasize that
the system is tested extensively in terms of scalability, sensitivity, and accuracy; and it processes 200
transactions per second, which they argue that is much more than their partner, with a rate of 2.4
transactions per second.(Carcillo et al., 2018)

(Patil et al., 2018) suggest a fraud detection system on credit cards on a real-time basis analyzing
incoming transactions. It uses Hadoop network to encode data in HDFS format and the SAS system
converts the file to raw data. The raw data is transferred to the analytical model in order to build the data
model. That cycle helps the system learn the model by itself in a scalable and real-time manner.

4. Proposed Methodology

The proposed techniques are used in this paper, for detecting the frauds in credit card system. The
comparison are made for different machine learning algorithms such as Logistic Regression, Decision
Trees, Random Forest, to determine which algorithm gives suits best and can be adapted by credit card
merchants for identifying fraud transactions. The Figure1 shows the architectural diagram for
representing the overall system framework.

The processing steps are discussed in Table 1 to detect the best algorithm for the given dataset

Volume 23, Issue 6, June - 2021 Page -327


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

Algorithm:

Step 1 Read the dataset.

Step 2 Random Sampling is done on the data set to make it balanced.

Step 3 Divide the dataset into two parts i.e., Train dataset and Test dataset.

Step 4 Feature selections are applied for the proposed models.

Accuracy and performance metrics has been calculated to know the efficiency for
Step 5
different algorithms.

Step 6 Then retrieve the best algorithm based on efficiency for the given dataset.

4.1. Logistics Regression

Logistic Regression is a supervised classification method that returns the probability of binary dependent
variable that is predicted from the independent variable of dataset that is logistic regression predict the
probability of an outcome which has two values either zero or one, yes or no and false or true. Logistic
regression has similarities to linear regression but as in linear regression a straight line is obtained,
logistic regression shows a curve. The use of one or several predictors or independent variable is on what
prediction is based; logistic regression produces logistic curves which plots the values between zero and
one.

Regression is a regression model where the dependent variable is categorical and analyzes the relationship
between multiple independent variables. There are many types of logistic regression model such as binary
logistic model, multiple logistic model, and binomial logistic models. Binary Logistic Regression model
is used to estimate the probability of a binary response based on one or more predictors.

( )

Volume 23, Issue 6, June - 2021 Page -328


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

Above equation represents the logistic regression in mathematical form.

Figure 1: Logistic curve

This graph shows the difference between linear regression and logistic regression where logistic
regression shows a curve but linear regression represents a straight line

4.2. Decision Tree

Decision tree is an algorithm that uses a tree like graph or model of decisions and their possible outcomes
to predict the final decision, this algorithm uses conditional control statement. A Decision tree is an
algorithm for approaching discrete-valued target functions, in which decision tree is denoted by a learned
function. For inductive learning these types of algorithms are very famous and have been successfully
applied to abroad range of tasks. We give label to a new transaction that is whether it is legit or fraud for
which class label is unknown and then transaction value is tested against the decision tree, and after that
from root node to output/class label for that transaction a path is traced.

Decision rules determine the outcome of the content of leaf node. In general rules have the form of ‗If
condition 1 and condition 2 but not condition 3 then outcome‘. Decision tree helps to determine the worst,
best and expected values for different scenarios, simplified to understand and interpret and allows
addition of new possible scenarios.

( ) ∑

Volume 23, Issue 6, June - 2021 Page -329


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

| |
( ) ( ) ∑ ( )
| |
( )

Steps for making a decision tree are that firstly to Calculate the entropy of every attribute using the
dataset in problem then dataset is divided into subsets using the attribute for which gain is maximum or
entropy is minimum after that to make a decision tree node containing that attribute and lastly recursion is
performed on subsets using remaining attributes to create a decision tree.

Figure 2: Decision Tree

4.3. Random Forest

Random Forest is an algorithm for classification and regression. Summarily, it is a collection of decision
tree classifiers. Random forest has advantage over decision tree as it corrects the habit of over fitting to
their training set. A subset of the training set is sampled randomly so that to train each individual tree and
then a decision tree is built; each node then splits on a feature selected from a random subset of the full
feature set. Even for large data sets with many features and data instances training is extremely fast in
random forest and because each tree is trained independently of the others. The Random Forest algorithm
has been found to provide a good estimate of the generalization error and to be resistant to over fitting.

( | ) ( )
( | )
( )

Where, ( | ): represents the posterior probability, ( | ): represents the likelihood, ( ): represents


the class of prior probability, ( ): represents the predictor prior probability.

( | ) ( | ) ( | ) ( | ) ( | ) ( )

Volume 23, Issue 6, June - 2021 Page -330


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

Random forest ranks the importance of variables in a regression or classification problem in a natural way
can be done by Random Forest.

5. Experiments & Results

First the credit card dataset is taken from the source and cleaning and validation is performed on the
dataset which includes removal of redundancy, filling empty spaces in columns, converting necessary
variable into factors or classes then data is split into two parts, one is training dataset and another one is
test data set. Now K fold cross validation is done that is the original sample is randomly partitioned into k
equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for
testing the model, and the remaining k −1 subsamples are used as training data, Models are created for
Logistic regression, Decision tree, SVM, Random Forest and then accuracy, sensitivity, specificity,
precision are calculated and a comparison is made.

Figure 3: System Architecture

Volume 23, Issue 6, June - 2021 Page -331


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

5.1. Performance metrics

There are a variety of measures for various algorithms and these measures have been developed to
evaluate very different things .So it should be criteria for evaluation of various proposed method. False
Positive (FP), False Negative (FN), True Positive (TP), True Negative (TN) and the relation between
them are quantities which usually adopted by credit card fraud detection researchers to compare the
accuracy of different approaches. The definitions of mentioned parameters are presented below:

 True Positive (TP): The true positive rate represents the portion of the fraudulent transactions
correctly being classified as fraudulent transactions.

( )
 True Negative (TN): The true negative rate represents the portion of the normal transactions
correctly being classified as normal transactions.

( )
 False Positive (FP): The false positive rate indicates the portion of the non-fraudulent
transactions wrongly being classified as fraudulent transactions.

( )
 False Negative (FN): The false negative rate indicates the portion of the non-fraudulent
transactions wrongly being classified as normal transactions.

( )
 Confusion matrix: The confusion matrix provides more insight into not only the performance of
a predictive model, but also which classes are being predicted correctly, which incorrectly, and
what type of errors are being made. The simplest confusion matrix is for a two-class classification
problem, with negative and positive classes. In this type of confusion matrix, each cell in the table
has a specific and well understood name

Predicted Positive Negative

Positive TP FN

Negative FP TN

 Accuracy: Accuracy is the percentage of correctly classified instances. It is one of the most
widely used classification performance metrics.

Volume 23, Issue 6, June - 2021 Page -332


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

or for binary classification models. The accuracy can be defined as:


( )
( )

 Precision: Precision is the number of classified Positive or fraudulent instances that actually are
positive instances.

( )
 Recall: Recall is a metric that quantifies the number of correct positive predictions made out of
all positive predictions that could have been made. Unlike precision that only comments on the
correct positive predictions out of all positive predictions, recall provides an indication of missed
positive predictions. Recall is calculated as the number of true positives divided by the total
number of true positives and false negatives.

( )
 F1-score: F1 Score is the weighted average of Precision and Recall. Therefore, this score takes
both false positives and false negatives into account.
( )
( )
 Support: The support is the number of samples of the true response that lie in that class. Support
is the number of actual occurrences of the class in the specified dataset. Imbalanced support in the
training data may indicate structural weaknesses in the reported scores of the classifier and could
indicate the need for stratified sampling or rebalancing. Support doesn‘t change between models
but instead diagnoses the evaluation process.
5.2. Results and Discussion
5.2.1. Importing Dataset

The data we are going to use is the Kaggle Credit Card Fraud Detection dataset1. It contains features V1
to V28 which are the principal components obtained by PCA. We are going to neglect the time feature
which is of no use to build the models. The remaining features are the ‗Amount‘ feature that contains the
total amount of money being transacted and the ‗Class‘ feature that contains whether the transaction is a
fraud case or not.

Volume 23, Issue 6, June - 2021 Page -333


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

Now let‘s import the data using the ‗read_csv‘ method and print the data to have a look at it in python.

5.2.2. Exploratory Data Analysis and Visualization

This section will explore the data and determine the best approaches to preprocessing, which will be
performed in the following section.

Features: Dataset have 28 anonymized features in our data, each of the float64 data type. The dataset
also have data on the transaction amount, time, and class.

Volume 23, Issue 6, June - 2021 Page -334


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

Anonymized Features: The anonymized features have also undergone dimensionality reduction with
PCE in the original dataset used in this project. This was likely done to both anonymize the data and to
make the data easier to work with.

Time: The time is represented as the number of seconds since the time of the first transaction in the
dataset.

Class variable: The Class variable labels transactions as either fraudulent (1) or non-fraudulent (0). This
will serve as our target. We will use the features in the dataset to predict the Class.

Transaction Amount and Time

The amount and time variables are the only two features that have not been adjusted or scaled in the
original dataset. Let us now explore these features.

Both of these features have interesting distributions.

The Amount column is heavily right skewed, and we can see that although the average transaction
amount is \$88, the maximum value is over $25,000! There are many outliers in this feature column, and
we will consider scaling this feature in later steps.

Volume 23, Issue 6, June - 2021 Page -335


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

We also see that time has an interesting distribution, but it has no outliers. Again, we will focus on scaling
the features after we have dealt with the skewed target variable by oversampling/under sampling our
dataset.

Anonymized Features exploration

Lastly, let's look at our 28 anonymized features. Although we cannot know exactly what these features
mean due to the prior dimensionality reduction, we can see how they are distributed.

As we can see, each of the features has a different distribution, and many have outliers. They all appear to
be centered on 0, but the range of values differs across the features.

5.2.3. Data Preprocessing

Now that we have visualized our data, we will preprocess the data prior to developing our machine
learning algorithms.

Volume 23, Issue 6, June - 2021 Page -336


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

Missing Values

There are no missing values in our dataset, so there is no need for further action on this front.
Feature Scaling
As we have seen, many of our features are skewed, and they have very diverse ranges of values. To
correct for this, we will scale our data.
While the anonymized features have already been reduced through PCA and thus undergone initial
scaling prior to the PCA, we will scale them in this step. The reason for doing so is to bring all of our
features to similar scales, even if they have been scaled before. Our chief concern in this regard is to
ensure that our models are able to make predictions based on the information carried by our features,
rather than by their ranges.
X_train shape: (227845, 30)
y_train shape: (227845,)
X_test shape: (56962, 30)
y_test shape: (56962,)
Data Imbalance

As the target variable is heavily skewed, predictive models may be prone to assuming that test cases are
genuine. However, we want our models to be able to accurately determine if new transactions are
fraudulent given the input features. If we do not correct the data imbalance somehow, then our models
will tend to over fit the genuine transaction data and are likely to be insensitive to fraudulent transactions.
In this research work we are considering a undersampling approaches to correcting the imbalance.
Undersampling involves sampling a random subset of the data such that genuine and fraudulent
transactions are represented. The main issue with this approach is that in our case, we would only have
~800 data points in total, since there are only about 400 fraudulent transactions. This would waste much
of our data and result in models that are significantly less powerful than those given ample training data.

Figure: Distribution of target variable after undersampling


Volume 23, Issue 6, June - 2021 Page -337
Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

5.2.4. Model evaluation results:

Volume 23, Issue 6, June - 2021 Page -338


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

5.2.5. Model Test Results and discussion

Decision Tree Model Test Results

Confusion matrix plot of decision tree model test results

Volume 23, Issue 6, June - 2021 Page -339


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

Logistics Regression Model Test Results

Confusion matrix plot of Logistics regression tree model test results

Volume 23, Issue 6, June - 2021 Page -340


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

Random Forest Model Test Results

Confusion matrix plot of Random Forest model test results

ROC curve of test models

Volume 23, Issue 6, June - 2021 Page -341


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

6. Conclusion

In this paper, we studied applications of machine learning like Decision Tree, Logistic regression,
Random forest with boosting the data quality with applying data preprocessing steps and shows that it
proves accurate in deducting fraudulent transaction and minimizing the number of false alerts. Supervised
learning algorithms are novel one in this literature in terms of application domain. If these algorithms are
applied into bank credit card fraud detection system, the probability of fraud transactions can be predicted
soon after credit card transactions. And a series of anti-fraud strategies can be adopted to prevent banks
from great losses and reduce risks. The objective of the study was taken differently than the typical
classification problems in that we had a variable misclassification cost. Precision, recall, f1-score, support
and accuracy are used to evaluate the performance for the proposed system. By comparing all the three
methods, we found that random forest classifier with boosting technique is better than the logistic
regression and decision tree methods.

References

[1] Akila, S., & Reddy, U. S. (2018). Cost-sensitive Risk Induced Bayesian Inference Bagging

Volume 23, Issue 6, June - 2021 Page -342


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

(RIBIB) for credit card fraud detection. Journal of Computational Science, 27, 247–254.
[2] Askari, S. M. S., & Hussain, M. A. (2020). IFDTC4. 5: Intuitionistic fuzzy logic based
decision tree for E-transactional fraud detection. Journal of Information Security and
Applications, 52, 102469.
[3] Carcillo, F., Dal Pozzolo, A., Le Borgne, Y.-A., Caelen, O., Mazzer, Y., & Bontempi, G.
(2018). Scarff: a scalable framework for streaming credit card fraud detection with spark.
Information Fusion, 41, 182–194.
[4] Carcillo, F., Le Borgne, Y.-A., Caelen, O., Kessaci, Y., Oblé, F., & Bontempi, G. (2019).
Combining unsupervised and supervised learning in credit card fraud detection. Information
Sciences.
[5] Chouiekh, A., & Haj, E. L. H. I. E. L. (2018). Convnets for fraud detection analysis.
Procedia Computer Science, 127, 133–138.
[6] Dal Pozzolo, A., Boracchi, G., Caelen, O., Alippi, C., & Bontempi, G. (2017). Credit card
fraud detection: a realistic modeling and a novel learning strategy. IEEE Transactions on
Neural Networks and Learning Systems, 29(8), 3784–3797.
[7] de Sá, A. G. C., Pereira, A. C. M., & Pappa, G. L. (2018). A customized classification
algorithm for credit card fraud detection. Engineering Applications of Artificial Intelligence,
72, 21–29.
[8] Deshe, W., Chen, B., & Chen, J. (2018). Credit Card Fraud Detection Strategies with
Consumer Incentives‖. ELSEVIER.
[9] Dornadula, V. N., & Geetha, S. (2019). Credit card fraud detection using machine learning
algorithms. Procedia Computer Science, 165, 631–641.
[10] Fiore, U., De Santis, A., Perla, F., Zanetti, P., & Palmieri, F. (2019). Using generative
adversarial networks for improving classification effectiveness in credit card fraud detection.
Information Sciences, 479, 448–455.
[11] Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study.
Intelligent Data Analysis, 6(5), 429–449.
[12] Jurgovsky, J., Granitzer, M., Ziegler, K., Calabretto, S., Portier, P.-E., He-Guelton, L., &
Caelen, O. (2018). Sequence classification for credit-card fraud detection. Expert Systems
with Applications, 100, 234–245.
[13] Kim, E., Lee, J., Shin, H., Yang, H., Cho, S., Nam, S., Song, Y., Yoon, J., & Kim, J. (2019).
Champion-challenger analysis for credit card fraud detection: Hybrid ensemble and deep
learning. Expert Systems with Applications, 128, 214–224.
[14] Lucas, Y., Portier, P.-E., Laporte, L., He-Guelton, L., Caelen, O., Granitzer, M., &

Volume 23, Issue 6, June - 2021 Page -343


Journal of University of Shanghai for Science and Technology ISSN: 1007-6735

Calabretto, S. (2020). Towards automated feature engineering for credit card fraud detection
using multi-perspective HMMs. Future Generation Computer Systems, 102, 393–402.
[15] Misra, S., Thakur, S., Ghosh, M., & Saha, S. K. (2020). An autoencoder based model for
detecting fraudulent credit card transaction. Procedia Computer Science, 167, 254–262.
[16] Nami, S., & Shajari, M. (2018). Cost-sensitive payment card fraud detection based on
dynamic random forest and k-nearest neighbors. Expert Systems with Applications, 110,
381–392.
[17] Patil, S., Nemade, V., & Soni, P. K. (2018). Predictive modelling for credit card fraud
detection using data analytics. Procedia Computer Science, 132, 385–395.
[18] Pourhabibi, T., Ong, K.-L., Kam, B. H., & Boo, Y. L. (2020). Fraud detection: A systematic
literature review of graph-based anomaly detection approaches. Decision Support Systems,
133, 113303.
[19] Roy, A., Sun, J., Mahoney, R., Alonzi, L., Adams, S., & Beling, P. (2018). Deep learning
detecting fraud in credit card transactions. 2018 Systems and Information Engineering
Design Symposium (SIEDS), 129–134.
[20] Rtayli, N., & Enneya, N. (2020). Selection features and support vector machine for credit
card risk identification. Procedia Manufacturing, 46, 941–948.
[21] Ryman-Tubb, N. F., Krause, P., & Garn, W. (2018). How Artificial Intelligence and
machine learning research impacts payment card fraud detection: A survey and industry
benchmark. Engineering Applications of Artificial Intelligence, 76, 130–157.
[22] Saia, R., Carta, S., Reforgiato Recupero, D., & Fenu, G. (2019). Fraud Detection for E-
commerce Transactions by Employing a Prudential Multiple Consensus Model. Journal of
Information Security and Applications, 46. https://doi.org/10.1016/j.jisa.2019.02.007
[23] Wu, Y., Xu, Y., & Li, J. (2019). Feature construction for fraudulent credit card cash-out
detection. Decision Support Systems, 127, 113155.
[24] Zeager, M. F., Sridhar, A., Fogal, N., Adams, S., Brown, D. E., & Beling, P. A. (2017).
Adversarial learning in credit card fraud detection. 2017 Systems and Information
Engineering Design Symposium (SIEDS), 112–116.
[25] Zhang, X., Han, Y., Xu, W., & Wang, Q. (2019). HOBA: A novel feature engineering
methodology for credit card fraud detection with a deep learning architecture. Information
Sciences.
[26] Zhu, H., Liu, G., Zhou, M., Xie, Y., Abusorrah, A., & Kang, Q. (2020). Optimizing
Weighted Extreme Learning Machines for imbalanced classification and application to
credit card fraud detection. Neurocomputing, 407, 50–62.

Volume 23, Issue 6, June - 2021 Page -344


View publication stats

You might also like