0% found this document useful (0 votes)
31 views6 pages

Machine Learning For Credit Card Fraud D

ml

Uploaded by

venkatsai261103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views6 pages

Machine Learning For Credit Card Fraud D

ml

Uploaded by

venkatsai261103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 24 (2018) pp.

16819-16824
© Research India Publications. http://www.ripublication.com

Machine Learning For Credit Card Fraud Detection System

Lakshmi S V S S1,Selvani Deepthi Kavila2


1,2
Department of CSE, Anil Neerukonda Institute Of Technology And Sciences(A), Visakhapatnam-531162,India

Abstract Also, 0.05% (5 out of every 10,000) of all monthly active


accounts was fraudulent. Today, fraud detection systems are
The rapid growth in E-Commerce industry has lead to an
introduced to control one-twelfth of one percent of all
exponential increase in the use of credit cards for online
transactions processed which still translates into billions of
purchases and consequently they has been surge in the fraud
dollars in losses. Credit Card Fraud is one of the biggest
related to it .In recent years, For banks has become very
threats to business establishments today. However, to combat
difficult for detecting the fraud in credit card system. Machine
the fraud effectively, it is important to first understand the
learning plays a vital role for detecting the credit card fraud in
mechanisms of executing a fraud. Credit card fraudsters
the transactions. For predicting these transactions banks make
employ a large number of ways to commit fraud. In simple
use of various machine learning methodologies, past data has
terms, Credit Card Fraud is defined as “when an individual
been collected and new features are been used for enhancing
uses another individuals’ credit card for personal reasons
the predictive power. The performance of fraud detecting in
while the owner of the card and the card issuer are not aware
credit card transactions is greatly affected by the sampling
of the fact that the card is being used”. Card fraud begins
approach on data-set, selection of variables and detection
either with the theft of the physical card or with the important
techniques used. This paper investigates the performance of
data associated with the account, including the card account
logistic regression, decision tree and random forest for credit
number or other information that necessarily be available to a
card fraud detection. Dataset of credit card transactions is
merchant during a permissible transaction. Card numbers
collected from kaggle and it contains a total of 2,84,808
generally the Primary Account Number (PAN) are often
credit card transactions of a European bank data set. It
reprinted on the card, and a magnetic stripe on the back
considers fraud transactions as the “positive class” and
contains the data in machine-readable format. It contains the
genuine ones as the “negative class” .The data set is highly
following Fields:
imbalanced, it has about 0.172% of fraud transactions and the
rest are genuine transactions. The author has been done  Name of card holder
oversampling to balance the data set, which resulted in 60% of
fraud transactions and 40% genuine ones. The three  Card number
techniques are applied for the dataset and work is  Expiration date
implemented in R language. The performance of the
techniques is evaluated for different variables based on  Verification/CVV code
sensitivity, specificity, accuracy and error rate. The result  Type of card
shows of accuracy for logistic regression, Decision tree and
random forest classifier are 90.0, 94.3, 95.5 respectively. The There are more methods to commit credit card fraud.
comparative results show that the Random forest performs Fraudsters are very talented and fast moving people. In the
better than the logistic regression and decision tree techniques. Traditional approach, to be identified by this paper is
Application Fraud, where a person will give the wrong
Keywords: Fraud detection, Credit card, Logistic regression, information about himself to get a credit card. There is also
Decision tree, Random forest. the unauthorized use of Lost and Stolen Cards, which makes
up a significant area of credit card fraud. There are more
enlightened credit card fraudsters, starting with those who
1. INTRODUCTION produce Fake and Doctored Cards; there are also those who
Credit card fraud is a huge ranging term use Skimming to commit fraud. They will get this information
for theft and fraud committed using or involving at the time of held on either the magnetic strip on the back of the credit card,
payment by using this card. The purpose may be to purchase or the data stored on the smart chip is copied from one card to
goods without paying, or to transfer unauthorized funds from another. Site Cloning and False Merchant Sites on the Internet
an account. Credit card fraud is also an add on to identity are getting a popular method of fraud for many criminals with
theft. As per the information from the United States Federal a skilled ability for hacking. Such sites are developed to get
Trade Commission, the theft rate of identity had been holding people to hand over their credit card details without knowing
stable during the mid 2000s, but it was increased by 21 they have been swindled.
percent in 2008. Even though credit card fraud, that crime Rest of the paper is described as follows: section 2 describes
which most people associate with ID theft, decreased as a the related work about the credit card system, section 3
percentage of all ID theft complaints In 2000, out of 13 billion described the proposed system architecture and methodology,
transactions made annually, approximately 10 million or one section 4 shows the performance analysis and results, section
out of every 1300 transactions turned out to be fraudulent. 5 shows the conclusion.

16819
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 24 (2018) pp. 16819-16824
© Research India Publications. http://www.ripublication.com

2. RELATED WORK Network and Logistic Regression Classification and


explained ANN classifiers outperform LR classifiers in
A.Shen etal (2007) demonstrate the efficiency of classification
solving the problem under investigation. Here the training
models to credit card fraud detection problem and the authors
data sets distribution became more biased and the distribution
proposed the three classification models ie., decision tree,
of the training data sets became more biased and the
neural network and logistic regression. Among the three
efficiency of all models decreased in catching the fraudulent
models neural network and logistic regression outperforms
transactions.
than the decision tree. M.J.Islam et al (2007) proposed the
probability theory frame work for making decision under
uncertainty. After reviewing Bayesian theory, naïve bayes
3. PROPOSED TECHNIQUE:
classifier and k-nearest neighbor classifier is implemented and
applied to the dataset for credit card system.Y. Sahin and E. The proposed techniques are used in this paper, for detecting
Duman(2011) has cited the research for credit card fraud the frauds in credit card system. The comparison are made for
detection and used seven classification methods took a major different machine learning algorithms such as Logistic
role .In this work they have included decision trees and SVMs Regression, Decision Trees, Random Forest, to determine
to decrease the risk of the banks. They have suggested which algorithm gives suits best and can be adapted by credit
Artificial Neural networks and Logistic Regression card merchants for identifying fraud transactions. The
classification models are more helpful to improve the Figure1 shows the architectural diagram for representing the
performance in detecting the frauds. Y. Sahin, E. overall system framework.
Duman(2011) has cited the research , used Artificial Neural

The processing steps are discussed in Table 1 to detect the best algorithm for the given dataset
Table 1: Processing steps
Algorithm steps:
Step 1: Read the dataset.
Step 2: Random Sampling is done on the data set to make it balanced.
Step 3: Divide the dataset into two parts i.e., Train dataset and Test dataset.
Step 4: Feature selection are applied for the proposed models.
Step 5: Accuracy and performance metrics has been calculated to know the efficiency for different algorithms.
Step6: Then retrieve the best algorithm based on efficiency for the given dataset.

Figure1: System Architecture

16820
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 24 (2018) pp. 16819-16824
© Research India Publications. http://www.ripublication.com

3.1 Logistic Regression: Logistic function is used in the logistic regression in which
cost function quantifies the error, as it models response is
Logistic Regression is one of the classification algorithm,
compared with the true value.
used to predict a binary values in a given set of independent
variables (1 / 0, Yes / No, True / False). To represent binary / X(θ)=−1/m*(∑ yilog(hθ(xi))+(1−yi)log(1−hθ(xi))) (3.6)
categorical values, dummy variables are used. For the purpose
Where
of special case in the logistic regression is a linear regression,
when the resulting variable is categorical then the log of odds hθ(xi) : logistic function
are used for dependent variable and also it predicts the
yi : outcome variable Gradient descent is a learning algorithm
probability of occurrence of an event by fitting data to a
logistic function. Such as
3.2 Decision Tree Algorithm:
O = e^(I0 + I1*x) / (1 + e^(I0 + I1*x)) (3.1) Decision tree is a type of supervised learning algorithm
(having a pre-defined target variable) that is mostly used in
Where,
classification problems. It works for both categorical and
O is the predicted output continuous input and output variables. In this technique, we
split the population or sample into two or more homogeneous
I0 is the bias or intercept term
sets (or sub-populations) based on most significant splitter /
I1 is the coefficient for the single input value (x). differentiator in input variables.
TYPES OF DECISION TREE
Each column in the input data has an associated I coefficient 1. Categorical Variable Decision Tree: Decision Tree
(a constant real value) that must be learned from the training which has categorical target variable then it called as
data. categorical variable decision tree.
y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x)) (3.2) 2. Continuous Variable Decision Tree: Decision Tree has
continuous target variable then it is called as Continuous
Logistic regression is started with the simple linear regression
Variable Decision Tree
equation in which dependent variable can be enclosed in a link
function i.e.,to start with logistic regression, I’ll first write the
simple linear regression equation with dependent variable
TERMINOLOGY OF DECISION TREE:
enclosed in a link function:
1. Root Node: It represents entire population or sample
A(O) = β0 + β(x) (3.3)
and this further gets divided into two or more
homogeneous sets.
Where 2. Splitting: It is a process of dividing a
node into two or more sub-nodes.
A() : link function
3. Decision Node: When a sub-node splits into further sub-
O : outcome variable
nodes, then it is called decision node.
x : dependent variable
4. Leaf/ Terminal Node: Nodes do not split
A function is established using two things: is called Leaf or Terminal node.
1) Probability of Success(pr) and 2) Probability of 5. Pruning: When we remove sub-nodes of a decision
Failure(1-pr). node, this process is called pruning. You can say
opposite process of splitting.
pr should meet following criteria: a) probability must always
be positive (since p >= 0) b) probability must 6. Branch / Sub-Tree: A sub section of
always be less than equals to 1 (since pr <= 1). By applying entire tree is called branch or sub-tree.
exponential in the first criteria and the value is always greater
7. Parent and Child Node: A node, which is divided into
than equals to 1.
sub-nodes is called parent node of sub-nodes where as
pr = exp(βo + β(x)) = e^(βo + β(x)) (3.4) sub-nodes are the child of parent node.
For the second criteria, same exponential is divided by adding
1 to it so that the value will be less than equals to 1
WORKING OF DECISION TREE
pr = e^(βo + β(x)) / e^(βo + β(x)) + 1 (3.5)
Decision trees use multiple algorithms to decide to split a node
in two or more sub- nodes. The creation of sub-nodes
increases the homogeneity of resultant sub-nodes. In other
words, we can say that purity of the node increases with
respect to the target variable. Decision tree splits the nodes on

16821
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 24 (2018) pp. 16819-16824
© Research India Publications. http://www.ripublication.com

all available variables and then selects the split which results Table 2: Algorithm steps for finding the Best algorithm
in most homogeneous sub-nodes.
Step 1: Import the dataset
1. Gini Index
Step 2: Convert the data into data frames format
2. Information Gain
Step3: Do random oversampling using ROSE package
3. Chi Square
Step4: Decide the amount of data for training data and
4. Reduction of Variance testing data
Step5: Give 70% data for training and remaining data for
testing.
Step6: Assign train dataset to the models
Step7: Choose the algorithm among 3 different algorithms
and create the model
Step8: Make predictions for test dataset for each algorithm
Step9: Calculate accuracy for each algorithm
Step10: Apply confusion matrix for each variable
Step11: Compare the algorithms for all the variables and find
out the best algorithm.

3.3 Random Forest:


4. PERFORMANCE METRICS AND EXPERIMENTAL
Random forest is a tree based algorithm which involves RESULTS:
building several trees and combining with the output to
4.1 Performance metrics:
improve generalization ability of the model. This method of
combining trees is known as an ensemble method. The basic performance measures derived from the confusion
Ensembling is nothing but a combination of weak learners matrix. The confusion matrix is a 2 by 2 matrix table contains
(individual trees) to produce a strong learner. Random Forest four outcomes produced by the binary classifier. Various
can be used to solve regression and classification problems. In measures such as sensitivity, specificity, accuracy and error
regression problems, the dependent variable is continuous. In rate are derived from the confusion matrix.
classification problems, the dependent variable is categorical.
Accuracy:
Accuracy is calculated as the total number of two correct
WORKING OF RANDOM FOREST: predictions(A+B) divided by the total number of the
dataset(C+D).It is calculated as (1-error rate).
Bagging Algorithm is used to create random samples. Data set
D1 is given for n rows and m columns and new data set D2 is Accuracy=A+B/C+D (4.1)
created for sampling n cases at random with replacement from
Whereas,
the original data. From dataset D1,1/3rd of rows are left out
and is known as Out of Bag samples. Then, new dataset D2 is A=True Positive
trained to this models and Out of Bag samples is used to
B=True Negative
determine unbiased estimate of the error. Out of m columns,
M << m columns are selected at each node in the data set. The C=Positive
M columns are selected at random. Usually, the default choice
of M, is m/3 for regression tree and M is sqrt(m) for D=Negative
classification tree. Unlike a tree, no pruning takes place in
random forest i.e, each tree is grown fully. In decision trees,
pruning is a method to avoid over fitting. Pruning means Error rate:
selecting a sub tree that leads to the lowest test error rate. Error rate is calculated as the total number of two incorrect
Cross validation is used to determine the test error rate of a predictions(F+E) divided by the total number of the
sub tree. Several trees are grown and the final prediction is dataset(C+D).
obtained by averaging or voting.
Error rate=F+E/C+D (4.2)
Whereas,
E=False Positive F=False Negative
C=Positive D=Negative

16822
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 24 (2018) pp. 16819-16824
© Research India Publications. http://www.ripublication.com

Sensitivity: [3] A. C. Bahnsen, A. Stojanovic, D. Aouada, B. Ottersten,


"Cost sensitive credit card fraud detection using Bayes
Sensitivity is calculated as the number of correct positive
minimum risk", Machine Learning and Applications
predictions(A) divided by the total number of positives(C).
(ICMLA). 2013 12th International Conference, vol. 1,
Sensitivity=A/C (4.3) pp. 333-338, 2013.
[4] B.Meena, I.S.L.Sarwani, S.V.S.S.Lakshmi,” Web
Service mining and its techniques in Web Mining”
Specificity: IJAEGT,Volume 2,Issue 1 , Page No.385-389.
Specificity is calculated as the number of correct negative [5] F. N. Ogwueleka, "Data Mining Application in Credit
predictions(B) divided by the total number of negatives(D). Card Fraud Detection System", Journal of Engineering
Science and Technology, vol. 6, no. 3, pp. 311-322,
Specificity=B/D. (4.4) 2011.
Accuracy, Error-rate, Sensitivity and Specificity are used to [6] G. Singh, R. Gupta, A. Rastogi, M. D. S. Chandel, A.
report the performance of the system to detect the fraud in the Riyaz, "A Machine Learning Approach for Detection of
credit card. Fraud based on SVM", International Journal of
In this paper, three machine learning algorithms are developed Scientific Engineering and Technology, vol. 1, no. 3,
to detect the fraud in credit card system. To evaluate the pp. 194-198, 2012, ISSN ISSN: 2277-1581.
algorithms, 70% of the dataset is used for training and 30% is [7] K. Chaudhary, B. Mallick, "Credit Card Fraud: The
used for testing and validation. Accuracy, error rate, study of its impact and detection techniques",
sensitivity and specificity are used to evaluate for different International Journal of Computer Science and
variables for three algorithms as shown in Table 3. The Network (IJCSN), vol. 1, no. 4, pp. 31-35, 2012, ISSN
accuracy result is shown for logistic regression; Decision tree ISSN: 2277-5420.
and random forest classifier are 92.7, 95.8, and 97.6
respectively. The comparative results show that the Random [8] M. J. Islam, Q. M. J. Wu, M. Ahmadi, M. A. Sid-
forest performs better than the logistic regression and decision Ahmed, "Investigating the Performance of Naive-Bayes
tree techniques. Classifiers and KNearestNeighbor Classifiers", IEEE
International Conference on Convergence Information
Table 3: Performance analysis for three different algorithms Technology, pp. 1541-1546, 2007.
Feature Selection Logistic Decision Random [9] R. Wheeler, S. Aitken, "Multiple algorithms for fraud
regression tree Forest detection" in Knowledge-Based Systems, Elsevier, vol.
For 5 variables 87.2 89 90.1 13, no. 2, pp. 93-99, 2000.
For 10 variables 88.6 92.1 93.6 [10] S. Patil, H. Somavanshi, J. Gaikwad, A. Deshmane, R.
For all Variables 90.0 94.3 95.5 Badgujar, "Credit Card Fraud Detection Using
Decision Tree Induction Algorithm", International
Journal of Computer Science and Mobile Computing
5. CONCLUSION (IJCSMC), vol. 4, no. 4, pp. 92-95, 2015, ISSN ISSN:
In this paper, Machine learning technique like Logistic 2320-088X.
regression, Decision Tree and Random forest were used to [11] S. Maes, K. Tuyls, B. Vanschoenwinkel, B. Manderick,
detect the fraud in credit card system. Sensitivity, Specificity, "Credit card fraud detection using Bayesian and neural
accuracy and error rate are used to evaluate the performance networks", Proceedings of the 1st international naiso
for the proposed system. The accuracy for logistic regression, congress on neuro fuzzy technologies, pp. 261-270,
Decision tree and random forest classifier are 90.0, 94.3, and 2002.
95.5 respectively. By comparing all the three method, found
that random forest classifier is better than the logistic [12] S. Bhattacharyya, S. Jha, K. Tharakunnel, J. C.
regression and decision tree. Westland, "Data mining for credit card fraud: A
comparative study", Decision Support Systems, vol. 50,
no. 3, pp. 602-613, 2011.
REFERENCES [13] Y. Sahin, E. Duman, "Detecting credit card fraud by
[1] Andrew. Y. Ng, Michael. I. Jordan, "On discriminative ANN and logistic regression", Innovations in Intelligent
vs. generative classifiers: A comparison of logistic Systems and Applications (INISTA) 2011 International
regression and naive bayes", Advances in neural Symposium, pp. 315-319, 2011.
information processing systems, vol. 2, pp. 841-848, [14] Selvani Deepthi Kavila,LAKSHMI S.V.S.S.,RAJESH
2002. B “ Automated Essay Scoring using Feature
[2] A. Shen, R. Tong, Y. Deng, "Application of Extraction Method “ IJCER ,volume 7,issue 4(L),
classification models on credit card fraud detection", Page No. 12161-12165.
Service Systems and Service Management 2007 [15] S.V.S.S.Lakshmi,K.S.Deepthi,Ch.Suresh “Text
International Conference, pp. 1-4, 2007. Summarization basing on Font and Cue-phrase

16823
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 24 (2018) pp. 16819-16824
© Research India Publications. http://www.ripublication.com

Feature for a Single Document”, Emerging ICT for


Bridging the Future − Volume 2, Advances in
Intelligent Systems and Computing ,Page No. 537-
542.
[16] Y. Sahin, S. Bulkan, E. Duman, "A cost-sensitive
decision tree approach for fraud detection", Expert
Systems with Applications, vol. 40, no. 15, pp. 5916-
5923, 2013.
[17] Y. Kou, C-T. Lu, S. Sinvongwattana, Y-P. Huang,
"Survey of Fraud Detection Techniques", Proceedings
of the 2004 IEEE International Conference on
Networking Sensing & Control, 2004.
[18] Y. Sahin, E. Duman, "Detecting Credit Card Fraud by
Decision Trees and Support Vector Machines",
Proceedings of International Multi-Conference of
Engineers and Computer Scientists (IMECS 2011), vol.
1, pp. 1-6, Mar. 16-18 2011, ISSN 2078-0966, ISBN
978-988-18210-3-4.

16824

You might also like