Credit Card Fraud Detection
Credit Card Fraud Detection
BACHELOR OF TECHNOLOGY IN
Submitted to-
Mr. Vijay Kumar Sharma
(Department of Computer Science & Engineering)
Submitted by-
KUSHAGRA 1900680100170
ASHISH JAIN 1900680100079
NITIKA TYAGI 1900680100228
5TH SEMESTER
We hereby declare that the project entitled - " CREDIT CARD FRAUD
DETECTION", which is being submitted as Mini Project in department of
Computer Science and engineering to Meerut Institute of Engineering and
Technology, Meerut (U.P.) is an authentic record of my genuine work done under
the guidance of Prof. "Mr.
Vijay Kumar Sharma" of Computer Science and Engineering, Meerut Institute of
Engineering and Technology, Meerut.
DATE:
KUSHAGRA( 1900680100170)
This is to certify that mini project report entitled - "CREDIT CARD FRAUD
DETECTION" submitted by "KUSHAGRA, ASHISH JAIN, NITIKA TYAGI" has
been carried out under the guidance of Prof. "Mr. Vijay Kumar Sharma" of
Computer Science and Engineering, Meerut Institute of Engineering and
Technology, Meerut. This project report is approved for Mini Project (KCS-354)
in 5TH semester in "Computer Science and Engineering" from Meerut Institute
of Engineering and Technology, Meerut.
Internal Examiner
ACKNOWLEDGEMENT
I express my sincere indebtedness towards our guide Prof., "Mr. Vijay kumar
Sharma" of Computer Science and Engineering, Meerut Institute of Engineering
and Technology, Meerut for his valuable suggestion, guidance and supervision
throughout the work. Without his king patronage and guidance, the project would
not have taken shape. I would also like to express my gratitude and sincere
regards for his kind approval of the project. Time to time counseling and advises. I
would also like to thank to our HOD Dr. (Prof.) "M.I.H.Ansari", Department of
Computer Science and engineering, Meerut Institute of Engineering and
Technology, Meerut for his expert advice and counseling from time to time.
I owe sincere thanks to all the faculty members in the department of Computer
Science and engineering for their kind guidance and encouragement time to time.
DATE:
KUSHAGRA ( 1900680100329)
Description page
no.
Declaration
Certificate ii
Acknowledgement iii
Chapter 1 Introduction
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
THE SOURCE CODE FOR CREDIT CARD FRAUD DETECTION IN THE PYTHON IS
#dataset information
credit_card_data.info()
#checking the number of missing values in each coloumncredit
credit_card_data.isnull().sum()
#the distribution of legit transcation and fraud transaction
credit_card_data['Class'].value_counts()
#separating the data for analysis
legit=credit_card_data[credit_card_data.Class == 0]
fraud=credit_card_data[credit_card_data.Class == 1]
print(legit.shape)
print(fraud.shape)
#statistical measures of the data
legit.Amount.describe()
fraud.Amount.describe()
#compare the values for both transaction
credit_card_data.groupby('Class').mean()
legit_sample=legit.sample(n=385)
#concating two data frames
new_dataset=pd.concat([legit_sample,fraud],axis=0)
new_dataset.head()
new_dataset.tail()
new_dataset['Class'].value_counts()
credit_card_data.groupby('Class').mean()
X=new_dataset.drop(columns='Class',axis=1)
Y=new_dataset['Class']
print(X)
print(Y)
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,stratify=Y,random
_state=2)
print(X.shape, X_train.shape, X_test.shape)
model=LogisticRegression()
#training the logistic REgression Model with Training Data
model.fit(X_train, Y_train)
#accuracy on training data
X_train_prediction=model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction,Y_train)
print('Accuracy on training data : ',training_data_accuracy)
#accuracy on test data
X_test_prediction=model.predict(X_test)
test_data_accuracy=accuracy_score(X_test_prediction,Y_test)
print('Accuracy score on Test Data : ',test_data_accuracy)
CHAPTER 2
TECHNOLOGY BUCKET
Python is an interpreted high-level general-purpose
programming language. Its design philosophy
emphasizes code readability with its use of significant
indentation. Its language constructs as well as
its object-oriented approach aim to
help programmers write clear, logical code for small
and large-scale projects.[31]
Python is dynamically-typed and garbage-collected. It
supports multiple programming paradigms,
including structured (particularly, procedural), object-
oriented and functional programming. It is often
described as a "batteries included" language due to
its comprehensive standard library.
IMPORT NUMPY AS np:---
HARDWARE - laptop
SOFTWARE - IDLE
CHAPTER 3
OUTPUT SCREENS
[]
#first five rows of the data set
credit_card_data.head()
[]
credit_card_data.tail()
[]
#dataset information
credit_card_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200503 entries, 0 to 200502
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 200503 non-null float64
1 V1 200503 non-null float64
2 V2 200503 non-null float64
3 V3 200503 non-null float64
4 V4 200503 non-null float64
5 V5 200503 non-null float64
6 V6 200503 non-null float64
7 V7 200503 non-null float64
8 V8 200503 non-null float64
9 V9 200503 non-null float64
10 V10 200503 non-null float64
11 V11 200503 non-null float64
12 V12 200503 non-null float64
13 V13 200503 non-null float64
14 V14 200503 non-null float64
15 V15 200503 non-null float64
16 V16 200503 non-null float64
17 V17 200503 non-null float64
18 V18 200503 non-null float64
19 V19 200503 non-null float64
20 V20 200503 non-null float64
21 V21 200503 non-null float64
22 V22 200502 non-null float64
23 V23 200502 non-null float64
24 V24 200502 non-null float64
25 V25 200502 non-null float64
26 V26 200502 non-null float64
27 V27 200502 non-null float64
28 V28 200502 non-null float64
29 Amount 200502 non-null float64
30 Class 200502 non-null float64
dtypes: float64(31)
memory usage: 47.4 MB
[]
#checking the number of missing values in each coloumncredit
credit_card_data.isnull().sum()
Time 0
V1 0
V2 0
V3 0
V4 0
V5 0
V6 0
V7 0
V8 0
V9 0
V10 0
V11 0
V12 0
V13 0
V14 0
V15 0
V16 0
V17 0
V18 0
V19 0
V20 0
V21 0
V22 1
V23 1
V24 1
V25 1
V26 1
V27 1
V28 1
Amount 1
Class 1
dtype: int64
[]
#the distribution of legit transcation and fraud transaction
credit_card_data['Class'].value_counts()
0.0 200117
1.0 385
Name: Class, dtype: int64
[]
#separating the data for analysis
legit=credit_card_data[credit_card_data.Class == 0]
fraud=credit_card_data[credit_card_data.Class == 1]
[]
print(legit.shape)
print(fraud.shape)
(200117, 31)
(385, 31)
[]
#statistical measures of the data
legit.Amount.describe()
count 200117.000000
mean 89.657607
std 248.525731
min 0.000000
25% 5.990000
50% 23.000000
75% 79.150000
max 19656.530000
Name: Amount, dtype: float64
[]
fraud.Amount.describe()
count 385.000000
mean 121.808805
std 256.061414
min 0.000000
25% 1.000000
50% 12.310000
75% 104.810000
max 2125.870000
Name: Amount, dtype: float64
[]
#compare the values for both transaction
credit_card_data.groupby('Class').mean()
Under Sampling
Build a sample dataset containing similar distribution of normal transaction and fraud transaction
[]
legit_sample=legit.sample(n=385)
[]
#concating two data frames
new_dataset=pd.concat([legit_sample,fraud],axis=0)
[]
new_dataset.head()
[]
new_dataset.tail()
[]
new_dataset['Class'].value_counts()
1.0 385
0.0 385
Name: Class, dtype: int64
[]
credit_card_data.groupby('Class').mean()
[]
X=new_dataset.drop(columns='Class',axis=1)
Y=new_dataset['Class']
[]
print(X)
Time V1 V2 ... V27 V28 Amount
187479 127554.0 -1.680526 -0.959150 ... 0.231809 0.253554 680.91
10642 17894.0 1.065329 0.164677 ... -0.004638 0.030949 38.03
17818 28940.0 0.900371 -0.101406 ... 0.118023 0.014876 1.52
17122 28454.0 -0.389789 1.196861 ... -0.122716 0.127077 1.00
111478 72243.0 -0.196299 1.309606 ... 0.332336 0.146552 48.72
... ... ... ... ... ... ... ...
192687 129808.0 1.522080 -0.519429 ... -0.015551 0.041881 276.17
195383 131024.0 0.469750 -1.237555 ... -0.117858 0.144774 723.21
197586 132086.0 -0.361428 1.133472 ... -0.001250 -0.182751 480.72
198868 132688.0 0.432554 1.861373 ... 0.387039 0.319402 1.00
199896 133184.0 -1.212682 -2.484824 ... 0.212663 0.431095 1335.00
[]
print(Y)
187479 0.0
10642 0.0
17818 0.0
17122 0.0
111478 0.0
...
192687 1.0
195383 1.0
197586 1.0
198868 1.0
199896 1.0
Name: Class, Length: 770, dtype: float64
[]
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=2)
[]
print(X.shape, X_train.shape, X_test.shape)
(770, 30) (616, 30) (154, 30)
Model Training
logistic regression
[]
model=LogisticRegression()
[]
#training the logistic REgression Model with Training Data
model.fit(X_train, Y_train)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge
(status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
Evalution of Model
Accuracy score
[]
#accuracy on training data
X_train_prediction=model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction,Y_train)
[]
print('Accuracy on training data : ',training_data_accuracy)
Accuracy on training data : 0.9383116883116883
[]
#accuracy on test data
X_test_prediction=model.predict(X_test)
test_data_accuracy=accuracy_score(X_test_prediction,Y_test)
[]
print('Accuracy·score·on·Test·Data·:·',test_data_accuracy)
This is the output screen that show the word typed by user with
keyboard
]
#first five rows of the data set
credit_card_data.head()
[]
credit_card_data.tail()
[]
#dataset information
credit_card_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200503 entries, 0 to 200502
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 200503 non-null float64
1 V1 200503 non-null float64
2 V2 200503 non-null float64
3 V3 200503 non-null float64
4 V4 200503 non-null float64
5 V5 200503 non-null float64
6 V6 200503 non-null float64
7 V7 200503 non-null float64
8 V8 200503 non-null float64
9 V9 200503 non-null float64
10 V10 200503 non-null float64
11 V11 200503 non-null float64
12 V12 200503 non-null float64
13 V13 200503 non-null float64
14 V14 200503 non-null float64
15 V15 200503 non-null float64
16 V16 200503 non-null float64
17 V17 200503 non-null float64
18 V18 200503 non-null float64
19 V19 200503 non-null float64
20 V20 200503 non-null float64
21 V21 200503 non-null float64
22 V22 200502 non-null float64
23 V23 200502 non-null float64
24 V24 200502 non-null float64
25 V25 200502 non-null float64
26 V26 200502 non-null float64
27 V27 200502 non-null float64
28 V28 200502 non-null float64
29 Amount 200502 non-null float64
30 Class 200502 non-null float64
dtypes: float64(31)
memory usage: 47.4 MB
[]
#checking the number of missing values in each coloumncredit
credit_card_data.isnull().sum()
Time 0
V1 0
V2 0
V3 0
V4 0
V5 0
V6 0
V7 0
V8 0
V9 0
V10 0
V11 0
V12 0
V13 0
V14 0
V15 0
V16 0
V17 0
V18 0
V19 0
V20 0
V21 0
V22 1
V23 1
V24 1
V25 1
V26 1
V27 1
V28 1
Amount 1
Class 1
dtype: int64
[]
#the distribution of legit transcation and fraud transaction
credit_card_data['Class'].value_counts()
0.0 200117
1.0 385
Name: Class, dtype: int64
[]
#separating the data for analysis
legit=credit_card_data[credit_card_data.Class == 0]
fraud=credit_card_data[credit_card_data.Class == 1]
[]
print(legit.shape)
print(fraud.shape)
(200117, 31)
(385, 31)
[]
#statistical measures of the data
legit.Amount.describe()
count 200117.000000
mean 89.657607
std 248.525731
min 0.000000
25% 5.990000
50% 23.000000
75% 79.150000
max 19656.530000
Name: Amount, dtype: float64
[]
fraud.Amount.describe()
count 385.000000
mean 121.808805
std 256.061414
min 0.000000
25% 1.000000
50% 12.310000
75% 104.810000
max 2125.870000
Name: Amount, dtype: float64
[]
#compare the values for both transaction
credit_card_data.groupby('Class').mean()
Under Sampling
Build a sample dataset containing similar distribution of normal transaction and fraud transaction
[]
legit_sample=legit.sample(n=385)
[]
#concating two data frames
new_dataset=pd.concat([legit_sample,fraud],axis=0)
[]
new_dataset.head()
[]
new_dataset.tail()
[]
new_dataset['Class'].value_counts()
1.0 385
0.0 385
Name: Class, dtype: int64
[]
credit_card_data.groupby('Class').mean()
[]
X=new_dataset.drop(columns='Class',axis=1)
Y=new_dataset['Class']
[]
print(X)
Time V1 V2 ... V27 V28 Amount
187479 127554.0 -1.680526 -0.959150 ... 0.231809 0.253554 680.91
10642 17894.0 1.065329 0.164677 ... -0.004638 0.030949 38.03
17818 28940.0 0.900371 -0.101406 ... 0.118023 0.014876 1.52
17122 28454.0 -0.389789 1.196861 ... -0.122716 0.127077 1.00
111478 72243.0 -0.196299 1.309606 ... 0.332336 0.146552 48.72
... ... ... ... ... ... ... ...
192687 129808.0 1.522080 -0.519429 ... -0.015551 0.041881 276.17
195383 131024.0 0.469750 -1.237555 ... -0.117858 0.144774 723.21
197586 132086.0 -0.361428 1.133472 ... -0.001250 -0.182751 480.72
198868 132688.0 0.432554 1.861373 ... 0.387039 0.319402 1.00
199896 133184.0 -1.212682 -2.484824 ... 0.212663 0.431095 1335.00
[]
print(Y)
187479 0.0
10642 0.0
17818 0.0
17122 0.0
111478 0.0
...
192687 1.0
195383 1.0
197586 1.0
198868 1.0
199896 1.0
Name: Class, Length: 770, dtype: float64
[]
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=2)
[]
print(X.shape, X_train.shape, X_test.shape)
(770, 30) (616, 30) (154, 30)
Model Training
logistic regression
[]
model=LogisticRegression()
[]
#training the logistic REgression Model with Training Data
model.fit(X_train, Y_train)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge
(status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
Evalution of Model
Accuracy score
[]
#accuracy on training data
X_train_prediction=model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction,Y_train)
[]
print('Accuracy on training data : ',training_data_accuracy)
Accuracy on training data : 0.9383116883116883
[]
#accuracy on test data
X_test_prediction=model.predict(X_test)
test_data_accuracy=accuracy_score(X_test_prediction,Y_test)
[]
print('Accuracy·score·on·Test·Data·:·',test_data_accuracy)
Accuracy score on Test Data : 0.9025974025974026
[]
#first five rows of the data set
credit_card_data.head()
[]
credit_card_data.tail()
[]
#dataset information
credit_card_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200503 entries, 0 to 200502
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 200503 non-null float64
1 V1 200503 non-null float64
2 V2 200503 non-null float64
3 V3 200503 non-null float64
4 V4 200503 non-null float64
5 V5 200503 non-null float64
6 V6 200503 non-null float64
7 V7 200503 non-null float64
8 V8 200503 non-null float64
9 V9 200503 non-null float64
10 V10 200503 non-null float64
11 V11 200503 non-null float64
12 V12 200503 non-null float64
13 V13 200503 non-null float64
14 V14 200503 non-null float64
15 V15 200503 non-null float64
16 V16 200503 non-null float64
17 V17 200503 non-null float64
18 V18 200503 non-null float64
19 V19 200503 non-null float64
20 V20 200503 non-null float64
21 V21 200503 non-null float64
22 V22 200502 non-null float64
23 V23 200502 non-null float64
24 V24 200502 non-null float64
25 V25 200502 non-null float64
26 V26 200502 non-null float64
27 V27 200502 non-null float64
28 V28 200502 non-null float64
29 Amount 200502 non-null float64
30 Class 200502 non-null float64
dtypes: float64(31)
memory usage: 47.4 MB
[]
#checking the number of missing values in each coloumncredit
credit_card_data.isnull().sum()
Time 0
V1 0
V2 0
V3 0
V4 0
V5 0
V6 0
V7 0
V8 0
V9 0
V10 0
V11 0
V12 0
V13 0
V14 0
V15 0
V16 0
V17 0
V18 0
V19 0
V20 0
V21 0
V22 1
V23 1
V24 1
V25 1
V26 1
V27 1
V28 1
Amount 1
Class 1
dtype: int64
[]
#the distribution of legit transcation and fraud transaction
credit_card_data['Class'].value_counts()
0.0 200117
1.0 385
Name: Class, dtype: int64
[]
#separating the data for analysis
legit=credit_card_data[credit_card_data.Class == 0]
fraud=credit_card_data[credit_card_data.Class == 1]
[]
print(legit.shape)
print(fraud.shape)
(200117, 31)
(385, 31)
[]
#statistical measures of the data
legit.Amount.describe()
count 200117.000000
mean 89.657607
std 248.525731
min 0.000000
25% 5.990000
50% 23.000000
75% 79.150000
max 19656.530000
Name: Amount, dtype: float64
[]
fraud.Amount.describe()
count 385.000000
mean 121.808805
std 256.061414
min 0.000000
25% 1.000000
50% 12.310000
75% 104.810000
max 2125.870000
Name: Amount, dtype: float64
[]
#compare the values for both transaction
credit_card_data.groupby('Class').mean()
Under Sampling
Build a sample dataset containing similar distribution of normal transaction and fraud transaction
[]
legit_sample=legit.sample(n=385)
[]
#concating two data frames
new_dataset=pd.concat([legit_sample,fraud],axis=0)
[]
new_dataset.head()
[]
new_dataset.tail()
[]
new_dataset['Class'].value_counts()
1.0 385
0.0 385
Name: Class, dtype: int64
[]
credit_card_data.groupby('Class').mean()
[]
X=new_dataset.drop(columns='Class',axis=1)
Y=new_dataset['Class']
[]
print(X)
Time V1 V2 ... V27 V28 Amount
187479 127554.0 -1.680526 -0.959150 ... 0.231809 0.253554 680.91
10642 17894.0 1.065329 0.164677 ... -0.004638 0.030949 38.03
17818 28940.0 0.900371 -0.101406 ... 0.118023 0.014876 1.52
17122 28454.0 -0.389789 1.196861 ... -0.122716 0.127077 1.00
111478 72243.0 -0.196299 1.309606 ... 0.332336 0.146552 48.72
... ... ... ... ... ... ... ...
192687 129808.0 1.522080 -0.519429 ... -0.015551 0.041881 276.17
195383 131024.0 0.469750 -1.237555 ... -0.117858 0.144774 723.21
197586 132086.0 -0.361428 1.133472 ... -0.001250 -0.182751 480.72
198868 132688.0 0.432554 1.861373 ... 0.387039 0.319402 1.00
199896 133184.0 -1.212682 -2.484824 ... 0.212663 0.431095 1335.00
[]
print(Y)
187479 0.0
10642 0.0
17818 0.0
17122 0.0
111478 0.0
...
192687 1.0
195383 1.0
197586 1.0
198868 1.0
199896 1.0
Name: Class, Length: 770, dtype: float64
[]
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=2)
[]
print(X.shape, X_train.shape, X_test.shape)
(770, 30) (616, 30) (154, 30)
Model Training
logistic regression
[]
model=LogisticRegression()
[]
#training the logistic REgression Model with Training Data
model.fit(X_train, Y_train)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge
(status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
Evalution of Model
Accuracy score
[]
#accuracy on training data
X_train_prediction=model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction,Y_train)
[]
print('Accuracy on training data : ',training_data_accuracy)
Accuracy on training data : 0.9383116883116883
[]
#accuracy on test data
X_test_prediction=model.predict(X_test)
test_data_accuracy=accuracy_score(X_test_prediction,Y_test)
[]
print('Accuracy·score·on·Test·Data·:·',test_data_accuracy)
Accuracy score on Test Data : 0.9025974025974026
[]
#first five rows of the data set
credit_card_data.head()
[]
credit_card_data.tail()
[]
#dataset information
credit_card_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200503 entries, 0 to 200502
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 200503 non-null float64
1 V1 200503 non-null float64
2 V2 200503 non-null float64
3 V3 200503 non-null float64
4 V4 200503 non-null float64
5 V5 200503 non-null float64
6 V6 200503 non-null float64
7 V7 200503 non-null float64
8 V8 200503 non-null float64
9 V9 200503 non-null float64
10 V10 200503 non-null float64
11 V11 200503 non-null float64
12 V12 200503 non-null float64
13 V13 200503 non-null float64
14 V14 200503 non-null float64
15 V15 200503 non-null float64
16 V16 200503 non-null float64
17 V17 200503 non-null float64
18 V18 200503 non-null float64
19 V19 200503 non-null float64
20 V20 200503 non-null float64
21 V21 200503 non-null float64
22 V22 200502 non-null float64
23 V23 200502 non-null float64
24 V24 200502 non-null float64
25 V25 200502 non-null float64
26 V26 200502 non-null float64
27 V27 200502 non-null float64
28 V28 200502 non-null float64
29 Amount 200502 non-null float64
30 Class 200502 non-null float64
dtypes: float64(31)
memory usage: 47.4 MB
[]
#checking the number of missing values in each coloumncredit
credit_card_data.isnull().sum()
Time 0
V1 0
V2 0
V3 0
V4 0
V5 0
V6 0
V7 0
V8 0
V9 0
V10 0
V11 0
V12 0
V13 0
V14 0
V15 0
V16 0
V17 0
V18 0
V19 0
V20 0
V21 0
V22 1
V23 1
V24 1
V25 1
V26 1
V27 1
V28 1
Amount 1
Class 1
dtype: int64
[]
#the distribution of legit transcation and fraud transaction
credit_card_data['Class'].value_counts()
0.0 200117
1.0 385
Name: Class, dtype: int64
[]
#separating the data for analysis
legit=credit_card_data[credit_card_data.Class == 0]
fraud=credit_card_data[credit_card_data.Class == 1]
[]
print(legit.shape)
print(fraud.shape)
(200117, 31)
(385, 31)
[]
#statistical measures of the data
legit.Amount.describe()
count 200117.000000
mean 89.657607
std 248.525731
min 0.000000
25% 5.990000
50% 23.000000
75% 79.150000
max 19656.530000
Name: Amount, dtype: float64
[]
fraud.Amount.describe()
count 385.000000
mean 121.808805
std 256.061414
min 0.000000
25% 1.000000
50% 12.310000
75% 104.810000
max 2125.870000
Name: Amount, dtype: float64
[]
#compare the values for both transaction
credit_card_data.groupby('Class').mean()
Under Sampling
Build a sample dataset containing similar distribution of normal transaction and fraud transaction
[]
legit_sample=legit.sample(n=385)
[]
#concating two data frames
new_dataset=pd.concat([legit_sample,fraud],axis=0)
[]
new_dataset.head()
[]
new_dataset.tail()
[]
new_dataset['Class'].value_counts()
1.0 385
0.0 385
Name: Class, dtype: int64
[]
credit_card_data.groupby('Class').mean()
[]
X=new_dataset.drop(columns='Class',axis=1)
Y=new_dataset['Class']
[]
print(X)
Time V1 V2 ... V27 V28 Amount
187479 127554.0 -1.680526 -0.959150 ... 0.231809 0.253554 680.91
10642 17894.0 1.065329 0.164677 ... -0.004638 0.030949 38.03
17818 28940.0 0.900371 -0.101406 ... 0.118023 0.014876 1.52
17122 28454.0 -0.389789 1.196861 ... -0.122716 0.127077 1.00
111478 72243.0 -0.196299 1.309606 ... 0.332336 0.146552 48.72
... ... ... ... ... ... ... ...
192687 129808.0 1.522080 -0.519429 ... -0.015551 0.041881 276.17
195383 131024.0 0.469750 -1.237555 ... -0.117858 0.144774 723.21
197586 132086.0 -0.361428 1.133472 ... -0.001250 -0.182751 480.72
198868 132688.0 0.432554 1.861373 ... 0.387039 0.319402 1.00
199896 133184.0 -1.212682 -2.484824 ... 0.212663 0.431095 1335.00
[]
print(Y)
187479 0.0
10642 0.0
17818 0.0
17122 0.0
111478 0.0
...
192687 1.0
195383 1.0
197586 1.0
198868 1.0
199896 1.0
Name: Class, Length: 770, dtype: float64
[]
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=2)
[]
print(X.shape, X_train.shape, X_test.shape)
(770, 30) (616, 30) (154, 30)
Model Training
logistic regression
[]
model=LogisticRegression()
[]
#training the logistic REgression Model with Training Data
model.fit(X_train, Y_train)
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge
(status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
Evalution of Model
Accuracy score
[]
#accuracy on training data
X_train_prediction=model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction,Y_train)
[]
print('Accuracy on training data : ',training_data_accuracy)
Accuracy on training data : 0.9383116883116883
[]
#accuracy on test data
X_test_prediction=model.predict(X_test)
test_data_accuracy=accuracy_score(X_test_prediction,Y_test)
[]
print('Accuracy·score·on·Test·Data·:·',test_data_accuracy)
Accuracy score on Test Data : 0.9025974025974026
REFERENCES
1. From Google
2. Github
3. Kaggle