Credit Card Fraud Detection.....
Credit Card Fraud Detection.....
ABSTRACT:............................................................................................. iv
Domain overview...........................................................1
1.2. OBJECTIVES..................................................................................... 7
2. EXISTING SYSTEM:...........................................................................8
2.1. DISADVANTAGES:........................................................................9
2.2.1 ADVANTAGES:...........................................................9
General........................................................................10
3. FEASIBILITY STUDY:.......................................................................14
Data Wrangling............................................................14
Data collection.............................................................14
Preprocessing...............................................................14
4. PROJECT REQUIREMENTS.........................................................16
4.1. General:............................................................16
i
4.2. Environmental Requirements:...........................17
4.3.2.CONDA :.................................................................................. 21
4.3.4 PYTHON........................................................................................ 26
5. SYSTEM DIAGRAMS........................................................................36
6. LIST OF MODULES:........................................................................43
MODULE DIAGRAM...........................................................................46
MODULE DIAGRAM...........................................................................49
General Formula:.........................................................53
F1-Score Formula:........................................................53
MODULE DIAGRAM...........................................................................56
ii
6.2.3. RANDOM FOREST CLASSIFIER.................................57
MODULE DIAGRAM...........................................................................59
MODULE DIAGRAM...........................................................................61
MODULE DIAGRAM...........................................................................63
8. CONCLUSION................................................................................. 70
9. REFERENCES:................................................................................ 71
iii
ABSTRACT:
iv
1. INTRODUCTION
Domain overview
The term "data science" has been traced back to 1974, when
Peter Naur proposed it as an alternative name for computer science. In
1996, the International Federation of Classification Societies became the
first conference to specifically feature data science as a topic. However,
the definition was still in flux.
The term ―data science‖ was first coined in 2008 by D.J. Patil, and
Jeff Hammerbacher, the pioneer leads of data and analytics efforts at
LinkedIn and Facebook. In less than a decade, it has become one of the
hottest and most trending professions in the market.
Data Scientist:
1
Data scientists examine which questions need answering and where
to find the related data. They have business acumen and analytical skills
as well as the ability to mine, clean, and present data. Businesses use
data scientists to source, manage, and analyze large amounts of
unstructured data.
2
speech recognition and machine vision.
3
highest level in strategic game systems (such as chess and Go), As
machines become increasingly capable, tasks considered to require
"intelligence" are often removed from the definition of AI, a phenomenon
known as the AI effect. For instance, optical character recognition is
frequently excluded from things considered to be AI, having become a
routine technology.
4
since antiquity. Science fiction and futurology have also suggested that,
with its enormous potential and power, AI may become an existential
risk to humanity.
5
As the hype around AI has accelerated, vendors have been
scrambling to promote how their products and services use AI. Often what
they refer to as AI is simply one component of AI, such as machine
learning. AI requires a foundation of specialized hardware and software for
writing and training machine learning algorithms. No one programming
language is synonymous with AI, but a few, including Python, R and Java,
are popular.
7
relevant fields are filled in properly, AI tools often complete jobs quickly
and with relatively few errors.
8
1.1.4 MACHINE LEARNING
9
speech recognition, handwriting recognition, bio metric identification,
document classification etc.
10
Supervised Machine Learning is the majority of practical machine
learning uses supervised learning. Supervised learning is where have
input variables (X) and an output variable (y) and use an algorithm to
learn the mapping function from the input to the output is y = f(X). The
goal is to approximate the mapping function so well that when you have
new input data (X) that you can predict the output variables
(y) for that data. Techniques of Supervised Machine Learning algorithms
include logistic regression, multi-class classification, Decision Trees and
support vector machines etc. Supervised learning requires that the data
used to train the algorithm is already labeled with correct answers.
Supervised learning problems can be further grouped into Classification
problems. This problem has as goal the construction of a succinct model
that can predict the value of the dependent attribute from the attribute
variables. The difference between the two tasks is the fact that the
dependent attribute is numerical for categorical for classification. A
classification model attempts to draw some conclusion from observed
values. Given one or more inputs a classification model will try to predict
the value of one or more outcomes. A classification problem is when the
output variable is a category, such as ―red‖ or
―blue‖.
1.2. OBJECTIVES
The goal is to develop a machine learning model for Credit Card Fraud
Prediction, to potentially replace the updatable supervised machine
learning classification models by predicting results in the form of best
accuracy by comparing supervised algorithm
11
1.2.1 PROJECT GOALS
12
Exploration data analysis of variable identification
Loading the given dataset
Import required libraries packages
Analyze the general properties
Find duplicate and missing values
Checking unique and count values
Uni-variate data analysis
Rename, add data and drop the data
To specify data type
Exploration data analysis of bi-variate and multi-variate
Plot diagram of pairplot, heatmap, bar chart and Histogram
Method of Outlier detection with feature engineering
Pre-processing the given dataset
Splitting the test and training dataset
Comparing the Decision tree and Logistic regression model and
random forest etc.
Comparing algorithm to predict the result
Based on the best accuracy
.
1.2.2 SCOPE OF THE PROJECT
The main Scopeis to detect the Fraud Prediction, which is a classic text
classification problem with a help of machine learning algorithm. It is
needed to build a model that can differentiate between Fraud OR not
2. EXISTING SYSTEM:
They proposed a method and named it as Information-
Utilization- Method INUM it was first designed and the accuracy and
convergence of an information vector generated by INUM are analyzed.
The novelty of INUM is illustrated by comparing it with other methods. Two
D-vectors (i.e., feature subsets) a and b, where Ai is the ith feature in a
data set, are dissimilar in decision space, but correspond to the same
13
O-vector y in objective space. Assume that only a is
14
provided to decision-makers, but a becomes inapplicable due to an
accident or other reasons (e.g., difficulty to extract from the data set).
Then, decision-makers are in trouble. On the other hand, if all two feature
subsets are provided to them, they can have other choices to serve their
best interest. In other words, obtaining more equivalent D-vectors in the
decision space can provide more chances for decision- makers to ensure
that their interests are best served. Therefore, it is of great significance
and importance to solve MMOPs with a good Pareto front approximation
and also the largest number of D-vectors given each O-vector.
2.1. DISADVANTAGES:
2.2.1 ADVANTAGES:
15
Performance and accuracy of the algorithms can be calculated and
compared
Class imbalance can be dealt with machine learning approaches
16
BANK Dataset
Data Processing
Test
dataset
Classification ML Model
Training Algorithm
dataset
General
A literature review is a body of text that aims to review the critical
points of current knowledge on and/or methodological approaches to a
particular topic. It is secondary sources and discuss published information
in a particular subject area and sometimes information in a particular
subject area within a certain time period. Its ultimate goal is to bring the
reader up to date with current literature on a topic and forms the basis for
another goal, such as future research that may be needed in the area and
precedes a research proposal and may be just a simple summary of
sources. Usually, it has an organizational pattern and combines both
summary and synthesis.
A summary is a recap of important information about the source,
but a synthesis is a re-organization, reshuffling of information. It might
give a new interpretation of old material or combine new with old
interpretations or it might trace the intellectual progression of the field,
including major debates. Depending on the situation, the literature
review may evaluate the sources and advise the reader on the most
17
pertinent or relevant of them
18
Review of Literature Survey
Companies want to give more and more facilities to their customers. One
of these facilities is the online mode of buying goods. The customers now
can buy the required goods online but this is also an opportunity for
criminals to do frauds. The criminals can theft the information of any
cardholder and use it for online purchases until the cardholder contacts
the bank to block the card. This paper shows the different algorithms of
machine learning that are used for detecting this kind of transaction. The
research shows the CCF is the major issue of financial sector that is
increasing with the passage of time. More and more companies are
moving towards the online mode that allows the customers to make
online transactions. This is an opportunity for criminals to theft the
information or cards of other persons to make online transactions. The
most popular techniques that are used to theft credit card information are
phishing and Trojan. So a fraud detection system is needed to detect such
activities.
Nowadays credit card is more popular among the private and public
employees. By using the credit card, the users purchase the consumable
durable products in online, also transferring the amount from one account
to other. The fraudster is detecting the details of the behavior user
transaction and doing the illegal activities with the card by phishing,
Trojan virus, etc. The fraudulent may threaten the users on their sensitive
information. In this paper, we have discussed various methods of
19
detecting and controlling the fraudulent activities. This will be helpful to
improve the security for card transaction in future. Credit card fraudulent
activities which are faced by the people is one of the major issues. Due to
these fraudulent activities, many credit card users are losing their
money and their sensitive
20
information. In this paper, we have discussed the different fraudulent
detection and controlling techniques in credit card and also it will be
helpful to improve the security from the fraudsters in future to avoid the
illegal activities.
Year : 2021
Now a day, credit card transaction is one the famous mode for financial
transaction. Increasing trends of financial transactions through credit
cards also invite fraud activities that involve the loss of billions of dollars
globally. It is also been observed that fraudulent transactions have
increased by 35% from 2018. A huge amount of transaction data is
available to analyze the fraud detection activities that require analysis of
behavior/abnormalities in the transaction dataset to detect and ignore the
undesirable action of the suspected person. The proposed paper lists a
compressive summary of various techniques for the classification of fraud
transactions from the various datasets to alert the user for such
transactions. In the last decades, online transactions are growing rapidly
and the most common tool for financial transactions. The increasing
growth of online transactions also increases threats. Therefore, in keeping
in mind the security issue, nature, an anomaly in the credit card
transaction, the proposed work represents the summary of various
strategies applied to identify the abnormal transaction in the dataset of
credit card transaction datasets. This dataset contains a mix of normal
and fraud transactions; this proposed work classifies and summarizes the
various classification methods to classify the transactions using various
Machine Learning-based classifiers. The efficiency of the method depends
on the dataset and classifier used. The proposed summary will be
21
beneficial to the banker, credit card user, and researcher to analyze to
prevent credit card frauds. The future scope of this credit card fraud
detection is to explore the things in each and every associations and
banks to live safe and happily life. The data must be balanced in each
place and we are getting the best results.
22
Title : A Review On Credit Card Fraud Detection Using Machine Learning
Author: Suresh K Shirgave, Chetan J. Awati, Rashmi More, Sonam S. Patil
Year : 2019
In recent years credit card fraud has become one of the growing problem.
A large financial loss has greatly affected individual person using credit
card and also the merchants and banks. Machine learning is considered as
one of the most successful technique to identify the fraud. This paper
reviews different fraud detection techniques using machine learning and
compare them using performance measure like accuracy, precision and
specificity. The paper also proposes a FDS which uses supervised Random
Forest algorithm. With this proposed system the accuracy of detecting
fraud in credit card is increased. Further, the proposed system use
learning to rank approach to rank the alert and also effectively addresses
the problem concept drift in fraud detection. This paper has reviewed
various machine learning algorithm detect fraud in credit card transaction.
The performances of all this techniques are examined based on accuracy,
precision and specificity metrics. We have selected supervised learning
technique Random Forest to classify the alert as fraudulent or authorized.
This classifier will be trained using feedback and delayed supervised
sample. Next it will aggregate each probability to detect alerts. Further
we proposed learning to rank approach where alert will be ranked based
on priority. The suggested method will be able to solve the class
imbalance and concept drift problem. Future work will include applying
semi-supervised learning methods for classification of alert in FDS
Title : Credit Card Fraud Detection and Prevention using Machine Learning
Author: S. Abinayaa, H. Sangeetha, R. A. Karthikeyan, K. Saran Sriram, D. Piyush
Year : 2020
23
evaluated data set and providing current data set[1]. Finally, the accuracy
of the results data is optimised. Then the processing of a number of
attributes will be implemented, so that affecting fraud detection can be
found in viewing the representation of the graphical model. The
techniques efficiency
24
is measured based on accuracy, flexibility, and specificity, precision. The
results obtained with the use of the Random Forest Algorithm have proved
much more effective.
3. FEASIBILITY STUDY:
Data Wrangling
In this section of the report will load in the data, check for
cleanliness, and then trim and clean given dataset for analysis. Make sure
that the document steps carefully and justify for cleaning decisions.
Data collection
The data set collected for predicting given data is split into Training
set and Test set. Generally, 7:3 ratios are applied to split the Training set
and Test set. The Data Model which was created using Random Forest,
logistic, Decision tree algorithms and Support vector classifier (SVC) are
applied on the Training set and based on the test result accuracy, Test set
prediction is done.
Preprocessing
The data which was collected might contain missing values that
may lead to inconsistency. To gain better results data need to be
preprocessed so as to improve the efficiency of the algorithm. The outliers
have to be removed and also variable conversion need to be done.
25
unbiased in many tests and it is relatively easy to tune with.
26
CONSTRUCTION OF A PREDICTIVE MODEL
Machine learning needs data gathering have lot of past data‘s. Data
gathering have sufficient historical data and raw data. Before data pre-
processing, raw data can‘t be used directly. It‘s used to pre-process then,
what kind of algorithm with model. Training and testing this model
working and predicting correctly with minimum errors. Tuned model
involved by tuned time to time with improving the accuracy.
27
Data Gathering
Data Pre-Processing
Choose model
Train model
Test model
Tune model
Prediction
4. PROJECT REQUIREMENTS
4.1. General:
1. Functional requirements
2. Non-Functional requirements
3. Environment requirements
A. Hardware requirements
B. software requirements
28
4.1.2 Functional requirements:
1. Problem define
2. Preparing data
3. Evaluating algorithms
4. Improving results
5. Prediction the result
1. Software Requirements:
2. Hardware requirements:
RAM : minimum 2 GB
29
4.3. SOFTWARE DESCRIPTION
30
Analytics, and it is a Python distribution that comes preinstalled with lots
of useful python libraries for data science.
31
Anaconda is a distribution of the Python and R programming
languages for scientific computing (data science, machine learning
applications, large-scale data processing, predictive analytics, etc.), that
aims to simplify package management and deployment.
JupyterLab
Jupyter Notebook
Spyder
PyCharm
VSCode
Glueviz
Orange 3 App
RStudio
Anaconda Prompt (Windows only)
Anaconda PowerShell (Windows only)
32
Anaconda Navigator is a desktop graphical user interface (GUI)
included in Anaconda distribution.
33
Navigator allows you to launch common Python programs and easily
manage conda packages, environments, and channels without using
command-line commands. Navigator can search for packages on
Anaconda Cloud or in a local Anaconda Repository.
Anaconda comes with many built-in packages that you can easily find
with conda list on your anaconda prompt. As it has lots of
packages (many of which are rarely used), it requires lots of space and
time as well. If you have enough space, time and do not want to burden
yourself to install small utilities like JSON, YAML, you better go for
Anaconda.
4.3.2.CONDA :
35
dozens of programming languages". It was spun off from IPython in 2014
by Fernando Perez.
This will launch a new browser window (or a new tab) showing the
Notebook Dashboard, a sort of control panel that allows (among other
things) to select which notebook to open.
When started, the Jupyter Notebook App can access only files within
its start- up folder (including any sub-folder). No configuration is necessary
if you place your notebooks in your home folder or subfolders.
Otherwise, you need to choose a Jupyter Notebook App start-up folder
which will contain all the notebooks.
36
Save notebooks:Modifications to the notebooks are automatically saved
every few minutes. To avoid modifying the original notebook, make a
copy of the notebook document (menu file -> make a copy…) and save
the modifications on the copy.
Click on the menu Help -> User Interface Tour for an overview of the
Jupyter Notebook App user interface.
You can run the notebook document step-by-step (one cell a
time) by pressing shift + enter.
You can run the whole notebook in a single step by clicking on the menu
Cell
-> Run All.
To restart the kernel (i.e. the computational engine), click on the menu
Kernel
-> Restart. This can be useful to start over a computation from scratch
(e.g. variables are deleted, open files are closed, etc…).
37
JUPYTER NOTEBOOK APP:
38
The Notebook Dashboard has other features similar to a file
manager, namely navigating folders and renaming/deleting files
39
WORKING PROCESS:
Download and install anaconda and get the most useful package for
machine learning in Python.
Load a dataset and understand its structure using statistical
summaries and data visualization.
Machine learning models, pick the best and build confidence
that the accuracy is reliable.
Python is a popular and powerful interpreted language. Unlike R,
Python is a complete language and platform that you can use for both
research and development and developing production systems. There are
also a lot of modules and libraries to choose from, providing multiple ways
to do each task. It can feel overwhelming.
The best way to get started using Python for machine learning is to
complete a project.
It will force you to install and start the Python interpreter (at the very least).
It will give you a bird‘s eye view of how to step through a small project.
It will give you confidence, maybe to go on to your own small projects.
When you are applying machine learning to your own datasets, you are
working on a project. A machine learning project may not be linear, but it
has a number of well-known steps:
Define Problem.
Prepare Data.
Evaluate Algorithms.
Improve Results.
Present Results.
40
The best way to really come to terms with a new platform or tool is
to work through a machine learning project end-to-end and cover the key
steps. Namely, from loading data, summarizing data, evaluating
algorithms and making some predictions.
4.3.4 PYTHON
INTRODUCTION:
42
not completely backward-compatible. Python 2 was discontinued with
version 2.7.18 in 2020.
HISTORY:
Python 2.0 was released on 16 October 2000, with many major new
features, including a cycle-detecting garbage collector and support for
Unicode.
44
Python 3.9.2 and 3.8.8 were expeditedas all versions of Python
(including 2.7) had security issues, leading to possible remote code
execution and web cache poisoning.
45
Rather than having all of its functionality built into its core, Python
was designed to be highly extensible (with modules). This compact
modularity has made it particularly popular as a means of adding
programmable interfaces to existing applications. Van Rossum's vision of
a small core language with a large standard library and easily extensible
interpreter stemmed from his frustrations with ABC, which espoused the
opposite approach.
47
SYNTAX AND SEMANTICS :
INDENTATION :
48
condition is true.
49
The Try statement, which allows exceptions raised in its attached code
block to be caught and handled by except clauses; it also ensures that
clean-up code in a finally block will always be run regardless of how the
block exits.
The raise statement, used to raise a specified exception or re-raise a
caught exception.
The class statement, which executes a block of code and attaches its
local namespace to a class, for use in object-oriented programming.
The def statement, which defines a function or method.
The with statement, which encloses a code block within a context
manager (for example, acquiring a lock before the block of code is run
and releasing the lock afterwards, or opening a file and then closing it),
allowing resource-acquisition-is- initialization (RAII) - like behavior and
replaces a common try/finally idiom.
The break statement, exits from a loop.
The continue statement, skips this iteration and continues with the next
item.
The del statement, removes a variable, which means the reference
from the name to the value is deleted and trying to use that variable
will cause an error. A deleted variable can be reassigned.
The pass statement, which serves as a NOP. It is syntactically needed
to create an empty code block.
The assert statement, used during debugging to check for conditions
that should apply.
The yield statement, which returns a value from a generator function
and yield is also an operator. This form is used to implement co-
routines.
The return statement, used to return a value from a function.
The import statement, which is used to import modules whose
functions or variables can be used in the current program.
50
The assignment statement (=) operates by binding a name as a
reference to a separate, dynamically-allocated object. Variables may be
subsequently rebound at any time to any object. In Python, a variable
name is a generic reference holder and
51
does not have a fixed data type associated with it. However, at a given
time, a variable will refer to some object, which will have a type.
This is referred to as dynamic typing and is contrasted with statically-
typed programming languages, where each variable may only contain
values of a certain type.
EXPRESSIONS :
for exponentiation.
From Python 3.5, the new @ infix operator was introduced. It is
intended to be used by libraries such as NumPy for matrix
multiplication.
From Python 3.8, the syntax :=, called the 'walrus operator' was
introduced. It assigns values to variables as part of a larger expression.
In Python, == compares by value, versus Java, which compares
numerics by value and objects by reference. (Value comparisons in
Java on objects can be performed with the equals() method.) Python's
52
is operator may be used to compare object identities (comparison by
reference). In Python, comparisons may be chained, for example
A<=B<=C.
Python uses the words and, or, not for or its boolean operators rather
than the symbolic &&, ||, ! used in Java and C.
53
Python has a type of expression termed a list comprehension as well as
a more general expression termed a generator expression.
Anonymous functions are implemented using lambda expressions;
however, these are limited in that the body can only be one
expression.
Conditional expressions in Python are written as x if c else y (different
in order of operands from the c ? x : y operator common to many other
languages).
Python makes a distinction between lists and tuples. Lists are written
as [1, 2, 3], are mutable, and cannot be used as the keys of
dictionaries (dictionary keys must be immutable in Python). Tuples are
written as (1, 2, 3), are immutable and thus can be used as the keys of
dictionaries, provided all elements of the tuple are immutable. The +
operator can be used to concatenate two tuples, which does not
directly modify their contents, but rather produces a new tuple
containing the elements of both provided tuples. Thus, given the
variable t initially equal to (1, 2, 3), executing t = t + (4, 5) first
evaluates t + (4, 5), which yields (1, 2, 3, 4, 5), which is then assigned
back to t, thereby effectively "modifying the contents" of t, while
conforming to the immutable nature of tuple objects. Parentheses are
optional for tuples in unambiguous contexts.
Python features sequence unpacking wherein multiple expressions,
each evaluating to anything that can be assigned to (a variable, a
writable property, etc.), are associated in an identical manner to that
forming tuple literals and, as a whole, are put on the left-hand side of
the equal sign in an assignment statement. The statement expects an
iterable object on the right-hand side of the equal sign that produces
the same number of values as the provided writable expressions when
iterated through and will iterate through it, assigning each of the
produced values to the corresponding expression on the left.
Python has a "string format" operator %. This functions analogously
ton printf format strings in C, e.g. ―spam=%s eggs=%d‖ %
(―blah‖,2) evaluates to
54
―spam=blah eggs=2‖. In Python 3 and 2.6+, this was supplemented
by the format() method of the str class, e.g. ―spam={0}
eggs={1}‖.format(―blah‖,2). Python 3.6 added "f-strings": blah =
―blah‖; eggs = 2; f‗spam={blah} eggs={eggs}‘
Strings in Python can be concatenated, by "adding" them (same
operator as for adding integers and floats). E.g. ―spam‖ +
―eggs‖returns ―spameggs‖. Even if your
55
strings contain numbers, they are still added as strings rather than
integers. E.g.
―2‖ + ―2‖ returns ―2‖.
Python has various kinds of string literals:
o Strings delimited by single or double quote marks. Unlike in
Unix shells, Perl and Perl-influenced languages, single quote marks
and double quote marks function identically. Both kinds of string
use the backslash (\) as an escape character. String interpolation
became available in Python 3.6 as "formatted string literals".
o Triple-quoted strings, which begin and end with a series of three
single or double quote marks. They may span multiple lines and
function like here documents in shells, Perl and Ruby.
sequences are not interpreted; hence raw strings are useful where
literal backslashes are common, such as regular expressions and
Windows-style paths. Compare "@-quoting" in C#.
Python has array index and array slicing expressions on lists,
denoted as a[Key], a[start:stop] or a[start:stop:step]. Indexes are zero-
based, and negative indexes are relative to the end. Slices take
elements from the start index up to, but not including, the stop index.
The third slice parameter, called step or stride, allows elements to be
skipped and reversed. Slice indexes may be omitted, for example a[:]
returns a copy of the entire list. Each element of a slice is a shallow
copy.
56
Conditional expressions vs. if blocks
The eval() vs. exec() built-in functions (in Python 2, exec is a
statement); the former is for expressions, the latter is for statements.
57
Statements cannot be a part of an expression, so list and
other comprehensions or lambda expressions, all being
expressions, cannot contain statements. A particular case of this is that
an assignment statement such as a=1 cannot form part of the conditional
expression of a conditional statement. This has the advantage of avoiding
a classic C error of mistaking an assignment operator = for an equality
operator == in conditions: if (c==1) {…} is syntactically valid (but
probably unintended) C code but if c=1: … causes a syntax error in
Python.
METHODS :
TYPING :
Python uses duck typing and has typed objects but untyped variable
names. Type constraints are not checked at compile time; rather,
operations on an object may fail, signifying that the given object is not of
a suitable type. Despite being dynamically-typed, Python is strongly-
typed, forbidding operations that are not well- defined (for example,
adding a number to a string) rather than silently attempting to make
58
sense of them.
59
classes are instances of the metaclass type (itself an instance of itself),
allowing meta-programming and reflection.
Before version 3.0, Python had two kinds of classes: old-style and
new- style.The syntax of both styles is the same, the difference being
whether the class object is inherited from, directly or indirectly (all new-
style classes inherit from object and are instances of type). In versions of
Python 2 from Python 2.2 onwards, both kinds of classes can be used. Old-
style classes were eliminated in Python 3.0.
The long-term plan is to support gradual typing and from Python 3.5, the
syntax of the language allows specifying static types but they are not
checked in the default implementation, CPython. An experimental
optional static type checker named mypy supports compile-time type
checking.
5. SYSTEM DIAGRAMS
60
5.2 WORK FLOW DIAGRAM
Source Data
Training Testing
Dataset Dataset
61
Classification ML Best Model by Accuracy
Workflow Diagram
62
Use case diagrams are considered for high level requirement
analysis of a system. So when the requirements of a system are analyzed
the functionalities are captured in use cases. So, it can say that uses
cases are nothing but the system functionalities written in an organized
manner.
63
5.4 CLASS DIAGRAM:
64
Class diagram is basically a graphical representation of the static
view of the system and represents different aspects of the application. So
a collection of class diagrams represent the whole system. The name of
the class diagram should be meaningful to describe the aspect of the
system. Each element and their relationships should be identified in
advance Responsibility (attributes and methods) of each class should be
clearly identified for each class minimum number of properties should be
specified and because, unnecessary properties will make the diagram
complicated. Use notes whenever required to describe some aspect of the
diagram and at the end of the drawing it should be understandable to the
developer/coder. Finally, before making the final version, the diagram
should be drawn on plain paper and rework as many times as possible to
make it correct.
65
5.5 ACTIVITY DIAGRAM:
66
Activity is a particular operation of the system. Activity diagrams
are not only used for visualizing dynamic nature of a system but they are
also used to construct the executable system by using forward and
reverse engineering techniques. The only missing thing in activity
diagram is the message part. It does not show any message flow from
one activity to another. Activity diagram is some time considered as the
flow chart. Although the diagrams looks like a flow chart but it is not. It
shows different flow like parallel, branched, concurrent and single.
67
5.6 SEQUENCE DIAGRAM:
68
5.7 ENTITY RELATIONSHIP DIAGRAM (ERD)
6. LIST OF MODULES:
69
Data Pre-processing
Data Analysis of Visualization
70
Comparing Algorithm with prediction in the form of best accuracy result
Deployment Using Flask
71
Some of these sources are just simple random mistakes. Other
times, there can be a deeper reason why data is missing. It‘s
important to understand
72
these different types of missing data from a statistics point of view. The
type of missing data will influence how to deal with filling in the missing
values and to detect missing values, and do some basic imputation and
detailed statistical approach for dealing with missing data. Before, joint
into code, it‘s important to understand the sources of missing data. Here
are some typical reasons why data is missing:
Users chose not to fill out a field tied to their beliefs about how the
results would be used or interpreted.
import libraries for access and functional purpose and read the given
dataset
General Properties of Analyzing the given dataset
Display the given dataset in the form of data frame
show columns
shape of the data frame
To describe the data frame
Checking data type and information about dataset
Checking for duplicate data
Checking Missing values of data frame
Checking unique values of data frame
Checking count values of data frame
Rename and drop the given data frame
To specify the type of values
To create extra columns
73
MODULE DIAGRAM
74
6.1.2. DATA VALIDATION/ CLEANING/PREPARING PROCESS
Importing the library packages with loading given dataset. To
analyzing the variable identification by data shape, data type and
evaluating the missing values, duplicate values. A validation dataset is a
sample of data held back from training your model that is used to give
an estimate of model skill while tuning model's and procedures that you
can use to make the best use of validation and test datasets when
evaluating your models. Data cleaning / preparing by rename the given
dataset and drop the column etc. to analyze the uni-variate, bi-variate and
multi-variate process. The steps and techniques for data cleaning will vary
from dataset to dataset. The primary goal of data cleaning is to detect
and remove errors and anomalies to increase the value of data in
analytics and decision making.
75
Sometimes data does not make sense until it can look at in a visual
form, such as with charts and plots. Being able to quickly visualize of data
samples and others is an important skill both in applied statistics and in
applied machine learning. It will
76
discover the many types of plots that you will need to know when
visualizing data in Python and how to use them to better understand your
own data.
How to chart time series data with line plots and categorical
quantities with bar charts.
How to summarize data distributions with histograms and box plots.
77
MODULE DIAGRAM
FALSE POSITIVES (FP):A person who will pay predicted as defaulter. When
actual class is no and predicted class is yes. E.g. if actual class says this
78
passenger did not survive but predicted class tells you that this passenger
will survive.
79
FALSE NEGATIVES (FN):A person who default predicted as payer. When
actual class is yes but predicted class in no. E.g. if actual class value
indicates that this passenger survived and predicted class tells you that
passenger will die.
TRUE POSITIVES (TP):A person who will not pay predicted as defaulter.
These are the correctly predicted positive values which means that the
value of actual class is yes and the value of predicted class is also yes.
E.g. if actual class value indicates that this passenger survived and
predicted class tells you the same thing.
TRUE NEGATIVES (TN):A person who default predicted as payer. These are
the correctly predicted negative values which means that the value of
actual class is no and value of predicted class is also no. E.g. if actual
class says this passenger did not survive and predicted class tells you the
same thing.
In the next section you will discover exactly how you can do that in
Python with scikit-learn. The key to a fair comparison of machine learning
algorithms is ensuring that each algorithm is evaluated in the same way
on the same data and it
81
can achieve this by forcing each algorithm to be evaluated on a
consistent test harness.
Logistic Regression
Random Forest
Decision Tree Classifier
Naive Bayes
82
FN) False Positive rate(FPR) = FP /
(FP + TN)
83
ACCURACY:The Proportion of the total number of predictions that is
correct otherwise overall how often the model predicts correctly
defaulters and non- defaulters.
ACCURACY CALCULATION:
85
especially if you have an uneven class distribution. Accuracy works best if
false positives and false negatives have similar cost. If the cost of false
positives and false negatives are very different, it‘s better to look at both
Precision and Recall.
General Formula:
F1-Score Formula:
Sklearn:
86
In python, sklearn is a machine learning package which include a
lot of ML algorithms.
Here, we are using some of its modules like
train_test_split, DecisionTreeClassifier or Logistic
Regression and accuracy_score.
87
NUMPY:
It is a numeric python module which provides fast maths
functions for calculations.
It is used to read data in numpy arrays and for manipulation purpose.
PANDAS:
Used to read and write different files.
Data manipulation can be done easily with data frames.
MATPLOTLIB:
Data visualization is a useful way to help with identify the
patterns from given dataset.
Data manipulation can be done easily with data frames.
6.2.2LOGISTIC REGRESSION
89
For a binary regression, the factor level 1 of the dependent
variable should represent the desired outcome.
90
MODULE DIAGRAM
91
6.2.3. RANDOM FOREST CLASSIFIER
92
93
MODULE DIAGRAM
94
structure. It utilizes an if-then rule set which is mutually exclusive and
exhaustive for classification. The rules are learned sequentially using the
training data one at a time. Each time a rule is learned, the tuples covered
by the rules are removed.
95
This process is continued on the training set until meeting a termination
condition. It is constructed in a top-down recursive divide-and-conquer
manner. All the attributes should be categorical. Otherwise, they should
be discretized in advance. Attributes in the top of the tree have more
impact towards in the classification and they are identified using the
information gain concept.A decision tree can be easily over-fitted
generating too many branches and may reflect anomalies due to noise or
outliers.
96
MODULE DIAGRAM
97
6.2.5. NAIVE BAYES ALGORITHM:
98
MODULE DIAGRAM
99
7. CODING AND OUTPUT SCREENS
100
101
102
103
104
105
8. CONCLUSION
evaluation. The best accuracy on public test set is higher accuracy score
will be find out. This application can help to find the Prediction of credit
106
9. REFERENCES:
107
access, Oct. 9, 2019, doi: 10.1109/TSMC.2019.2944338.
• K. Deb and S. Tiwari, ―Omni-optimizer: A procedure for single and
multi- objective optimization,‖ in Proc. 3rd Int. Conf. Evol. Criterion
Optim. Optim. (EMO), Mar. 2005, pp. 47–61.
108
• K. Chan and T. Ray, ―An evolutionary algorithm to maintain
diversity in the parametric and the objective space,‖ in Proc. Int.
Conf. Comput. Robot. Auton. Syst. (CIRAS), 2005, pp. 13–16.
• A. Zhou, Q. Zhang, and Y. Jin, ―Approximating the set of Pareto-
optimal solutions in both the decision and objective spaces by an
estimation of distribution algorithm,‖ IEEE Trans. Evol. Comput., vol.
13, no. 5, pp. 1167– 1189, Oct. 2009.
• Y. Hu et al., ―A self-organizing multimodal multi-objective pigeon
inspired optimization algorithm,‖ Sci. China Inf. Sci., vol. 62, no. 7, Jul.
2019, Art. no. 70206.
• Y. Liu, G. G. Yen, and D. Gong, ―A multimodal multiobjective evolu
tionary algorithm using two-archive and recombination strategies,‖
IEEE Trans. Evol. Comput., vol. 23, no. 4, pp. 660–674, Aug. 2019.
• J. Liang, Q. Guo, C. Yue, B. Qu, and K. Yu, ―A self-organizing multi
objective particle swarm optimization algorithm for multimodal
multi objective problems,‖ in Proc. 9th Int. Conf. Advances Swarm Intell.
(ICSI), Shanghai, China, Jun. 2018, pp. 550–560.
• Y. Wang, Z. Yang, Y. Guo, J. Zhu, and X. Zhu, ―A novel multi
objective competitive swarm optimization algorithm for multi-modal
multi objective problems,‖ in Proc. IEEE Congr. Evol. Comput. (CEC),
Jun. 2019, pp. 271– 278.
• R. Shi, W. Lin, Q. Lin, Z. Zhu, and J. Chen, ―Multimodal multi-
objective optimization using a density-based one-by-one
update strategy,‖ in Proc. IEEE Congr. Evol. Comput. (CEC), Jun.
2019, pp. 295–301
• W. Zhang, G. Li, W. Zhang, J. Liang, and G. G. Yen, ―A cluster
based PSO with leader updating mechanism and ring-topology for
multimodal multi- objective optimization,‖ Swarm Evol. Comput., vol.
50, Nov. 2019, Art. no. 100569.
• R. Tanabe and H. Ishibuchi, ―A framework to handle multi-modal
multi objective optimization in decomposition-based evolutionary
algorithms,‖ IEEE Trans. Evol. Comput., vol. 24, no. 4, pp. 720–734,
Aug. 2020.
• J. Sun, S. Gao, H. Dai, J. Cheng, M. Zhou, and J. Wang, ―Bi-objective
elite differential evolution algorithm for multivalued logic networks,‖
IEEE Trans. Cybern., vol. 50, no. 1, pp. 233–246, Jan. 2020.
• W. Gu, Y. Yu, and W. Hu, ―Artificial bee colony algorithmbased
parameter estimation of fractional-order chaotic system with time
delay,‖ IEEE/CAA J. Automatica Sinica, vol. 4, no. 1, pp. 107–113, Jan.
2017.
• G. Wu, X. Shen, H. Li, H. Chen, A. Lin, and P. N. Suganthan,
―Ensemble of differential evolution variants,‖ Inf. Sci., vol. 423, pp.
172–186, Jan. 2018.
• J. Zhang and A. C. Sanderson, ―Self-adaptive multi-objective
differential evolution with direction information provided by
archived inferior solutions,‖ in Proc. IEEE Congr. Evol. Comput. (World
Congr. Comput. Intell.), Jun. 2008,
109
pp. 2801–2810.
• X. Qiu, J.-X. Xu, K. C. Tan, and H. A. Abbass, ―Adaptive cross
generation differential evolution operators for multiobjective opti
mization,‖ IEEE Trans. Evol. Comput., vol. 20, no. 2, pp. 232–244, Apr.
2016.
• J. J. Liang, B. Y. Qu, D. W. Gong, and C. T. Yue, ―Problemdefinitions
and evaluation criteria for the CEC 2019 specialsession on
multimodal
110
multiobjective optimization,‖ Comput. Intell. Lab., Zhengzhou Univ.,
Zhengzhou, China, Tech. Rep., 2019.
• C. Yue, B. Qu, K. Yu, J. Liang, and X. Li, ―A novel scalable test
problem suite for multimodal multiobjective optimization,‖ Swarm
Evol. Comput., vol. 48, pp. 62–71, Aug. 2019.
• S. Gao, M. Zhou, Y. Wang, J. Cheng, H. Yachi, and J. Wang, ―Den
dritic neuron model with effective learning algorithms for
classification, approximation, and prediction,‖ IEEE Trans. Neural
Netw. Learn. Syst., vol. 30, no. 2, pp. 601–614, Feb. 2019.
• L. Zheng, G. Liu, C. Yan, and C. Jiang, ―Transaction fraud detection
based on total order relation and behavior diversity,‖ IEEE Trans.
Comput. Social Syst., vol. 5, no. 3, pp. 796–806, Sep. 2018.
111