FINAL
FINAL
FINAL
PRESENTATION
CREDIT CARD FRAUD DETECTION
USING RANDOM FOREST & CART
ALGORITHM
CONTENTS
ABSTRACT
INTRODUCTION
SYSTEM ANALYSIS
SYSTEM ARCHITECTURE
SYSTEM MODULES
ALGORITHM UTILIZED
SYSTEM REQUIREMENT
CONCLUSION
REFERENCES
3
ABSTRACT
The credit card fraud is mostly come in financial services. The credit card fraud is generated huge
number of problems in every year. Lack of research on this credit card problem and submits the real-
world credit card fraud analyzes, that is issues. In this paper is introduced best data mining algorithm
called “machine learning algorithm”, which is utilized to recognize the credit card fraud, so initially use
this algorithm and it is one of the standard model. Then, secondly apply the hybrid methods namely,
“AdaBoost and majority vote method”. Use this model efficacy, which is evaluated, and then use the
credit card data set it is publicly available one. The financial institution included true world data set, so
it is taking and analyzed. In this robustness algorithm additionally evaluate the noise added data
samples. This concept is used in experiment and then produce the result positively indicate the hybrid
method, that is majority voting, it provides good accuracy rates in credit card fraud detection.
4
INTRODUCTION
•Risk assessment is widely used at banks around the world. Since, credit risk evolution is
very crucial ,n variety of techniques are used for risk level assessment.
•Banks classify clients according to their profile . While classifying , financial background
of customers and subjective factors of customers are evaluated.
5
EXISTING SYSTEM
credit card fraud detection system was proposed in , which consisted of a
rule-based filter, Dumpster–Shafer adder, transaction history database,
and Bayesian learner. The Dempster–Shafer theory combined various
evidential information and created an initial belief, which was used to
classify a transaction as normal, suspicious, or abnormal. If a transaction
was suspicious, the belief was further evaluated using transaction history
from Bayesian learning .
6
DISADVANTAGES:
•In this paper a new collative comparison measure that reasonably
represents the gains and losses due to fraud detection is proposed.
•A cost sensitive method which is based on Bayes minimum risk is
presented using the proposed cost measure.
7
PROPOSED SYSTEM:
In proposed System, we are applying random forest algorithm for classification
of the credit card dataset.
Random Forest is an algorithm for classification and regression.
Summarily, it is a collection of decision tree classifiers.
Random forest has advantage over decision tree as it corrects the habit of over
fitting to their training set.
The Random Forest algorithm has been found to provide a good estimate of the
generalization error and to be resistant to over fitting.
8
ADVANTAGES
•Random forests ranks the input of variables in a regression or
classification problem in a natural way can be done by Random Forest.
•The 'amount' feature is the transaction amount. Feature 'class' is the target
class for the binary classification and it takes value 1 for positive case
(fraud) and 0 for negative case (not fraud).
9
SYSTEM ARCHITECTURE:
First the credit card dataset is taken from the source and
cleaning and validation is performed on the dataset which
includes removal of redundancy, filling empty
spaces in columns, converting necessary variable into factors
or classes then data is divided into 2 part, one is training
dataset and another one is test data set. Now the original
sample is randomly partitioned into teat and train dataset.
10
SYSTEM MODULES:
MODULE 1:
oDATA COLLECTION :
This step is concerned with selecting the subset of all available data that you will be working with.
MODULE 2:
oDATA PRE-PROCESSING:
Organize your selected data by formatting, cleaning and sampling from it.
Three common data pre-processing steps are:
Formatting: The data you have selected may not be in a format that is suitable for you to work with.
Cleaning: Cleaning data is the removal or fixing of missing data. There may be data instances that are
incomplete and do not carry the data you believe you need to address the problem.
Sampling: There may be far more selected data available than you need to work with. More data can result
in much longer running times for algorithms and larger computational and memory requirements. 11
MODULE 3:
o FEATURE EXTRATION:
Next thing is to do Feature extraction is an attribute reduction
process. These algorithms are very popular in text classification
tasks.
MODULE 4:
o EVALUATION MODEL:
Model Evaluation is an integral part of the model development
process. It helps to find the best model that represents our data
and how well the chosen model will work in the future. There
are two methods of evaluating models in data science, Hold-
Out and Cross-Validation.
12
ALGORITHM UTILIZED:
WORKING OF RANDOM FOREST :-
The following are the basic steps involved in performing the random forest algorithm
1.Pick N random records from the dataset.
2.Build a decision tree based on these N records.
3.Choose the number of trees you want in your algorithm and repeat steps 1 and 2.
4.For classification problem, each tree in the forest predicts the category to which the
new record belongs. Finally, the new record is assigned to the category that wins the
majority vote.
13
SYSTEM REQUIREMENTS:
The requirements specification is a technical specification of
requirements for the software products. It is the first step in the
requirements analysis process it lists the requirements of a
particular software system including functional, performance and
security requirements. The purpose of software requirements
specification is to provide a detailed overview of the software
project, its parameters and goals.
14
HARDWARE REQUIREMENTS:
Processor - Intel
RAM - 4 Gb
Hard Disk - 260 GB
Key Board - Standard Windows Keyboard
Mouse - Two or Three Button Mouse
15
SOFTWARE REQUIREMENTS:
Python
Anaconda
OS - Windows 7, 8 and 10 (32 and 64 bit)
16
CONCLUSION
The Random forest algorithm will perform better with a
larger number of training data, but speed during testing and
application will suffer. Application of more pre-processing
techniques would also help. The SVM algorithm still suffers
from the imbalanced dataset problem and requires more
preprocessing to give better results at the results shown by
SVM is great but it could have been better if more preprocessing have been
done on the data.
17
REFERENCES:
[1] Sudhamathy G: Credit Risk Analysis and Prediction
Modelling of Bank Loans Using R, vol. 8, no-5, pp. 1954-1966.
[2] LI Changjian , HU Peng: Credit Risk Assessment for ural
Credit Cooperatives based on Improved Neural Network,
International Conference on Smart Grid and Electrical
Automation vol. 60, no. - 3, pp 227-230, 2017.
18
19
20