0% found this document useful (0 votes)

9 views39 pages

Credit Card-Fraud-Detection

The document provides an overview of a Python 3 environment set up for data analysis, specifically focusing on a credit card fraud dataset. It includes instructions for loading libraries, accessing data files, and examining the dataset's structure and statistics. The dataset consists of numerical features derived from PCA transformations, with 'Time' and 'Amount' as the only original features, and 'Class' indicating fraud occurrences.

Uploaded by

randhir931024

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views39 pages

Credit Card-Fraud-Detection

Uploaded by

randhir931024

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

# This Python 3 environment comes with many helpful analytics

libraries installed
# It is defined by the kaggle/python Docker image:
https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
warnings.filterwarnings('ignore')
# Input data files are available in the read-only "../input/"
directory
# For example, running this (by clicking run or pressing Shift+Enter)
will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/)

that gets preserved as output when you create a version using "Save &
Run All"
# You can also write temporary files to /kaggle/temp/, but they won't
be saved outside of the current session

/kaggle/input/creditcardfraud/creditcard.csv

import matplotlib.pyplot as plt

import seaborn as sns
import plotly.express as px

df=pd.read_csv('/kaggle/input/creditcardfraud/creditcard.csv')

cdf=df.copy()

df.sample(10)

Time V1 V2 V3 V4 V5
V6 \
157897 110593.0 -0.184771 1.108228 -0.000420 -0.407888 1.116536 -
1.256184
278441 168224.0 1.551207 -0.899886 -3.103947 0.029130 0.882554 -
0.751118
178316 123580.0 -0.226187 1.396738 -0.766720 0.839888 1.009486 -
0.218086
173828 121657.0 2.061814 -0.002162 -1.047232 0.412224 -0.088962 -
1.203159
181882 125160.0 2.411499 -0.945026 -2.179706 -1.663150 -0.141163 -
1.013506
249600 154489.0 2.189191 -0.673278 -1.421533 -1.095467 -0.245381 -
0.695507
186741 127236.0 -0.037430 1.258773 0.874234 2.963097 1.008588
0.033458
130582 79385.0 1.073372 0.155477 -0.175165 0.711253 1.042201
1.588291
178615 123706.0 1.654301 -0.217762 -2.065709 1.432106 0.516326 -
0.495317
131258 79536.0 -1.333894 1.325309 1.691594 -0.018299 -0.444247 -
0.446008

V7 V8 V9 ... V21 V22

V23 \
157897 1.621746 -0.591068 0.600826 ... 0.085034 0.670869 -
0.393378
278441 1.243737 -0.472464 -0.110084 ... 0.482927 0.839430 -
0.496715
178316 1.626510 -0.468615 -0.391006 ... 0.185604 0.928662 -
0.094267
173828 0.230400 -0.373535 0.439906 ... -0.280900 -0.642920
0.333149
181882 -0.193541 -0.561755 -1.988911 ... -0.057992 0.438085 -
0.019820
249600 -0.399127 -0.246410 -0.944722 ... 0.492110 1.409953
0.002608
186741 0.972456 -0.058097 -1.438243 ... 0.157799 0.583486 -
0.193589
130582 -0.016883 0.340067 -0.177211 ... 0.120799 0.562449 -
0.111037
178615 0.482452 -0.075117 0.126113 ... -0.047101 -0.475926
0.052558
131258 0.471467 0.416722 -0.756810 ... -0.182823 -0.615566
0.025398

V24 V25 V26 V27 V28 Amount

Class
157897 0.024006 0.307905 0.272145 0.056828 0.127693 23.40
0
278441 0.325256 0.665988 0.408737 -0.168670 -0.046225 310.00
0
178316 0.640909 -0.742096 -0.638334 -0.528740 -0.261750 59.60
0
173828 0.064582 -0.277073 0.193128 -0.066396 -0.058382 1.98
0
181882 0.537304 0.490309 0.089883 -0.042105 -0.071084 10.00
0
249600 0.849445 0.218507 0.009599 -0.042008 -0.066213 15.00
0
186741 -0.098863 -0.070838 0.026904 0.049372 0.046189 7.57
0
130582 -1.754190 0.520497 -0.154572 0.088233 0.003886 27.30
0
178615 0.533127 0.019823 -0.796977 -0.032667 -0.001254 169.42
0
131258 0.398609 0.031015 0.080103 0.032653 0.058260 45.32
0

[10 rows x 31 columns]

Info Provided by European cardholders Dataset Genrator:

It contains only numerical input variables which are the result of a PCA transformation.
Unfortunately, due to confidentiality issues, we cannot provide the original features and more
background information about the data. Features V1, V2, … V28 are the principal components
obtained with PCA, the only features which have not been transformed with PCA are 'Time' and
'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first
transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be
used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and
it takes value 1 in case of fraud and 0 otherwise.

df.describe()

Time V1 V2 V3
V4 \
count 284807.000000 2.848070e+05 2.848070e+05 2.848070e+05
2.848070e+05
mean 94813.859575 1.168375e-15 3.416908e-16 -1.379537e-15
2.074095e-15
std 47488.145955 1.958696e+00 1.651309e+00 1.516255e+00
1.415869e+00
min 0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -
5.683171e+00
25% 54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -
8.486401e-01
50% 84692.000000 1.810880e-02 6.548556e-02 1.798463e-01 -
1.984653e-02
75% 139320.500000 1.315642e+00 8.037239e-01 1.027196e+00
7.433413e-01
max 172792.000000 2.454930e+00 2.205773e+01 9.382558e+00
1.687534e+01

V5 V6 V7 V8
V9 \
count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
2.848070e+05
mean 9.604066e-16 1.487313e-15 -5.556467e-16 1.213481e-16 -
2.406331e-15
std 1.380247e+00 1.332271e+00 1.237094e+00 1.194353e+00
1.098632e+00
min -1.137433e+02 -2.616051e+01 -4.355724e+01 -7.321672e+01 -
1.343407e+01
25% -6.915971e-01 -7.682956e-01 -5.540759e-01 -2.086297e-01 -
6.430976e-01
50% -5.433583e-02 -2.741871e-01 4.010308e-02 2.235804e-02 -
5.142873e-02
75% 6.119264e-01 3.985649e-01 5.704361e-01 3.273459e-01
5.971390e-01
max 3.480167e+01 7.330163e+01 1.205895e+02 2.000721e+01
1.559499e+01

... V21 V22 V23 V24 \

count ... 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean ... 1.654067e-16 -3.568593e-16 2.578648e-16 4.473266e-15
std ... 7.345240e-01 7.257016e-01 6.244603e-01 6.056471e-01
min ... -3.483038e+01 -1.093314e+01 -4.480774e+01 -2.836627e+00
25% ... -2.283949e-01 -5.423504e-01 -1.618463e-01 -3.545861e-01
50% ... -2.945017e-02 6.781943e-03 -1.119293e-02 4.097606e-02
75% ... 1.863772e-01 5.285536e-01 1.476421e-01 4.395266e-01
max ... 2.720284e+01 1.050309e+01 2.252841e+01 4.584549e+00

V25 V26 V27 V28

Amount \
count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
284807.000000
mean 5.340915e-16 1.683437e-15 -3.660091e-16 -1.227390e-16
88.349619
std 5.212781e-01 4.822270e-01 4.036325e-01 3.300833e-01
250.120109
min -1.029540e+01 -2.604551e+00 -2.256568e+01 -1.543008e+01
0.000000
25% -3.171451e-01 -3.269839e-01 -7.083953e-02 -5.295979e-02
5.600000
50% 1.659350e-02 -5.213911e-02 1.342146e-03 1.124383e-02
22.000000
75% 3.507156e-01 2.409522e-01 9.104512e-02 7.827995e-02
77.165000
max 7.519589e+00 3.517346e+00 3.161220e+01 3.384781e+01
25691.160000

Class
count 284807.000000
mean 0.001727
std 0.041527
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
[8 rows x 31 columns]

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 284807 non-null float64
1 V1 284807 non-null float64
2 V2 284807 non-null float64
3 V3 284807 non-null float64
4 V4 284807 non-null float64
5 V5 284807 non-null float64
6 V6 284807 non-null float64
7 V7 284807 non-null float64
8 V8 284807 non-null float64
9 V9 284807 non-null float64
10 V10 284807 non-null float64
11 V11 284807 non-null float64
12 V12 284807 non-null float64
13 V13 284807 non-null float64
14 V14 284807 non-null float64
15 V15 284807 non-null float64
16 V16 284807 non-null float64
17 V17 284807 non-null float64
18 V18 284807 non-null float64
19 V19 284807 non-null float64
20 V20 284807 non-null float64
21 V21 284807 non-null float64
22 V22 284807 non-null float64
23 V23 284807 non-null float64
24 V24 284807 non-null float64
25 V25 284807 non-null float64
26 V26 284807 non-null float64
27 V27 284807 non-null float64
28 V28 284807 non-null float64
29 Amount 284807 non-null float64
30 Class 284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB

df.isnull().sum()

Time 0
V1 0
V2 0
V3 0
V4 0
V5 0
V6 0
V7 0
V8 0
V9 0
V10 0
V11 0
V12 0
V13 0
V14 0
V15 0
V16 0
V17 0
V18 0
V19 0
V20 0
V21 0
V22 0
V23 0
V24 0
V25 0
V26 0
V27 0
V28 0
Amount 0
Class 0
dtype: int64

There is no missing value so we can proced futher for more analysis.

# Percentage of non-fraudulent data in our dataset

print(f"Non-Fraudulent Transcations:
{round((df[df['Class']==0].shape[0]/df.shape[0])*100,3)}%")
# Percentage of fraudulent data in our dataset
print(f"Fraudulent Transcations:
{round((df[df['Class']==1].shape[0]/df.shape[0])*100,3)}%")

Non-Fraudulent Transcations: 99.827%

Fraudulent Transcations: 0.173%

fig,ax=plt.subplots(1,2,figsize=(18,5))
sns.histplot(df['Time'],kde=True,color='#0f85d6',ax=ax[0])
ax[0].set_title('Time Distribution')
ax[0].set_xlim(df['Time'].min(),df['Time'].max())

sns.histplot(df['Amount'],kde=True,ax=ax[1],color='#ef6810')
ax[1].set_title("Amount Distibution")
ax[1].set_xlim(df['Amount'].min(),df['Amount'].max())
plt.show()

Time Distribution is normal in a time period but the distibution of Amount is too right skew due
to large difference between min,max value of feature Amount.confirm is it using boxplot

fig,ax=plt.subplots(1,2,figsize=(18,5))
sns.boxplot(df[['Time','Amount']],ax=ax[0],palette=['#ef6810','#ef103f
'])
ax[0].set_title('Box Plot of Time & Amount')
sns.boxplot(x='Class',y='Amount',data=df,ax=ax[1])
ax[1].set_title('Box Plot Between Class Vs Amount')

plt.show()

The above box plot suggest that feature Time has no outliers and feature Amount has lot's of
Outliers and second box plot suggest that Class 0 has more outliers then class 1 and range from

We have given that from V1 to V28 features are scaled so we also scaled the Time and Amount
feature for futher analysis using RobustScaler if you are curious about why we select
RobustScaler instead of Standerscaler because it use IQR ,method which is robust to extreme
(outliers) values

'X(normal) = X−median(X)/IQR(x)
from sklearn.preprocessing import RobustScaler
scaler=RobustScaler()
df['Time_Scaled']=scaler.fit_transform(df['Time'].values.reshape(-
1,1))
df['Amount_Scaled']=scaler.fit_transform(df['Amount'].values.reshape(-
1,1))
# Drop Time and Amount Feature from dataset as we add Time_Scaled and
Amount_Scaled in DataFrame
df.drop(columns=['Time','Amount'],inplace=True,axis=1)

df.head()

V1 V2 V3 V4 V5 V6
V7 \
0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388
0.239599
1 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -
0.078803
2 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499
0.791461
3 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203
0.237609
4 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921
0.592941

V8 V9 V10 ... V22 V23 V24

V25 \
0 0.098698 0.363787 0.090794 ... 0.277838 -0.110474 0.066928
0.128539
1 0.085102 -0.255425 -0.166974 ... -0.638672 0.101288 -0.339846
0.167170
2 0.247676 -1.514654 0.207643 ... 0.771679 0.909412 -0.689281 -
0.327642
3 0.377436 -1.387024 -0.054952 ... 0.005274 -0.190321 -1.175575
0.647376
4 -0.270533 0.817739 0.753074 ... 0.798278 -0.137458 0.141267 -
0.206010

V26 V27 V28 Class Time_Scaled Amount_Scaled

0 -0.189115 0.133558 -0.021053 0 -0.994983 1.783274
1 0.125895 -0.008983 0.014724 0 -0.994983 -0.269825
2 -0.139097 -0.055353 -0.059752 0 -0.994972 4.983721
3 -0.221929 0.062723 0.061458 0 -0.994972 1.418291
4 0.502292 0.219422 0.215153 0 -0.994960 0.670579

[5 rows x 31 columns]

fig,ax=plt.subplots(1,2,figsize=(18,5))
sns.histplot(df['Time_Scaled'],kde=True,color='#0f85d6',ax=ax[0])
ax[0].set_title('Scaled Time Distribution')
ax[0].set_xlim(df['Time_Scaled'].min(),df['Time_Scaled'].max())

sns.histplot(df['Amount_Scaled'],kde=True,ax=ax[1],color='#ef6810')
ax[1].set_title(" Scaled Amount Distibution")
ax[1].set_xlim(df['Amount_Scaled'].min(),df['Amount_Scaled'].max())

plt.show()

Distribution remain same cause RobustScaler doesn't affect the normal distributed data. Time
feature is normal distributed so Time_Scaled remain normal distributed. RobustScaler also not
change the skewness from data but reduce the impact of outliers which we can see in our
Amount_Scaled distribution

from sklearn.model_selection import StratifiedKFold

As we know out dataset contain 99.827% of Non-Fraudulent Data and 0.173% of Fraudulent
Data and for futher analysis we want to split into train,test but normal train_test_split method
doesn't work here so we import StratifiedKFold to split Imbalance data

X=df.drop(columns=['Class'])
y=df['Class']
skf=StratifiedKFold(n_splits=5)
for train_index, test_index in skf.split(X, y):
print("Train Index:", train_index, "Test Index:", test_index)
x_train,x_test=X.iloc[train_index],X.iloc[test_index]
y_train,y_test=y.iloc[train_index],y.iloc[test_index]
#y_train,y_test

Train Index: [ 30473 30496 31002 ... 284804 284805 284806] Test
Index: [ 0 1 2 ... 57017 57018 57019]
Train Index: [ 0 1 2 ... 284804 284805 284806] Test
Index: [ 30473 30496 31002 ... 113964 113965 113966]
Train Index: [ 0 1 2 ... 284804 284805 284806] Test
Index: [ 81609 82400 83053 ... 170946 170947 170948]
Train Index: [ 0 1 2 ... 284804 284805 284806] Test
Index: [150654 150660 150661 ... 227866 227867 227868]
Train Index: [ 0 1 2 ... 227866 227867 227868] Test
Index: [212516 212644 213092 ... 284804 284805 284806]

# let's confirm how much percentage each category fall in train and
test data
#For y_train
print(f"Non-Fraudulent Transcations in y_train:
{round(((y_train.values==0).sum()/y_train.shape[0])*100,3)}%")
#Percentage of fraudulent data in y_train
print(f"Fraudulent Transcations in y_train:
{round(((y_train.values==1).sum()/y_train.shape[0])*100,3)}%")
print()
#For y_test
print(f"Non-Fraudulent Transcations in y_test:
{round(((y_test.values==0).sum()/y_test.shape[0])*100,3)}%")
#Percentage of fraudulent data in y_test
print(f"Fraudulent Transcations in y_train:
{round(((y_test.values==1).sum()/y_test.shape[0])*100,3)}%")

Non-Fraudulent Transcations in y_train: 99.827%

Fraudulent Transcations in y_train: 0.173%

Non-Fraudulent Transcations in y_test: 99.828%

Fraudulent Transcations in y_train: 0.172%

Data split is done and we can confirm that there is almost equal percentage of Fraudulent data
in y_train and y_test so we don't have to worry about Model learning problem due to absent of
minor data percentage thanks to StratifiedKFold()

df.shape

(284807, 31)

Random Undersampling

# Let's Do Random Undersampling to train the ML models cause our

dataset is imbalanced
#before preceding to Randomsample make sure your data is suffled so
you get suffled Non-Fraudulent for better result
df=df.sample(284807)
Fraud_Data=df[df['Class']==1]
Non_Fraud_Data=df[df['Class']==0][:492]
new_df=pd.concat([Fraud_Data,Non_Fraud_Data])
new_df=new_df.sample(n=new_df.shape[0],random_state=42)

new_df.sample(5)
V1 V2 V3 V4 V5 V6
V7 \
241445 -3.818214 2.551338 -4.759158 1.636967 -1.167900 -1.678413 -
3.144732
235616 0.218810 2.715855 -5.111658 6.310661 -0.848345 -0.882446 -
2.902079
197185 1.239276 -2.212808 -2.393390 -0.099816 1.242783 4.061149 -
0.709314
189701 -4.599447 2.762540 -4.656530 5.201403 -2.470388 -0.357618 -
3.767189
223915 1.795730 -1.502166 -2.715497 -1.538789 1.459905 3.469648 -
1.305134

V8 V9 V10 ... V22 V23

V24 \
241445 1.245106 -1.692541 -4.759931 ... 0.761712 -0.417694 -
0.469712
235616 0.939162 -3.627698 -1.873331 ... 1.037324 0.062325
0.532490
197185 0.854884 -0.473702 0.824698 ... -0.619283 -0.074234
0.651608
189701 0.061466 -1.836200 -1.470645 ... 0.261333 0.621415
0.994110
223915 0.897601 -0.115284 0.019465 ... 0.572347 -0.006897
0.612803

V25 V26 V27 V28 Class Time_Scaled \

241445 -0.225934 0.586415 -0.348107 0.087777 1 0.779344
235616 -0.149145 0.639580 0.351568 -0.001817 1 0.749257
197185 -0.287183 -0.696606 0.006652 0.038092 0 0.554436
189701 -0.687853 -0.337531 -1.612791 1.231425 1 0.514891
223915 -0.174634 -0.085343 0.016231 0.011851 0 0.691761

Amount_Scaled
241445 -0.157898
235616 -0.296793
197185 5.784951
189701 0.996996
223915 2.333543

[5 rows x 31 columns]

# Checking Correlation of different feature of new_df how they affect

each other
corr=new_df.corr()
corr

V1 V2 V3 V4 V5
V6 \
V1 1.000000 -0.805935 0.880538 -0.608230 0.859491
0.332333
V2 -0.805935 1.000000 -0.861414 0.675716 -0.808640 -
0.283880
V3 0.880538 -0.861414 1.000000 -0.768327 0.852098
0.472096
V4 -0.608230 0.675716 -0.768327 1.000000 -0.579329 -
0.437202
V5 0.859491 -0.808640 0.852098 -0.579329 1.000000
0.309079
V6 0.332333 -0.283880 0.472096 -0.437202 0.309079
1.000000
V7 0.887652 -0.835954 0.891257 -0.712186 0.840496
0.287024
V8 -0.083680 -0.026762 -0.169888 0.099635 -0.203067 -
0.559226
V9 0.653440 -0.693041 0.768250 -0.793418 0.663338
0.380462
V10 0.733944 -0.762334 0.855893 -0.799664 0.755953
0.430772
V11 -0.523940 0.621186 -0.722307 0.796607 -0.528512 -
0.501937
V12 0.591359 -0.665715 0.761600 -0.833658 0.619884
0.502880
V13 -0.046324 0.021022 -0.065471 0.054192 -0.110624 -
0.091394
V14 0.438897 -0.561305 0.658645 -0.798618 0.435106
0.540010
V15 0.125115 -0.177649 0.150661 -0.148488 0.097946 -
0.060266
V16 0.634484 -0.633820 0.727486 -0.731792 0.694026
0.435757
V17 0.675110 -0.640208 0.738870 -0.712483 0.747308
0.431248
V18 0.676987 -0.612440 0.701647 -0.643889 0.744391
0.361393
V19 -0.301488 0.211352 -0.319109 0.317646 -0.399834 -
0.218701
V20 -0.326337 0.349376 -0.372206 0.303630 -0.329809 -
0.123561
V21 0.014390 0.044777 0.024170 -0.019212 0.040931
0.023638
V22 -0.033194 0.000708 -0.065118 0.139337 -0.094804 -
0.000066
V23 -0.033872 0.172902 -0.032930 0.021456 -0.093423
0.317014
V24 -0.082214 0.000592 -0.026072 -0.018264 -0.142346 -
0.075999
V25 -0.086378 0.151777 -0.115695 -0.007283 -0.114951 -
0.109415
V26 0.041588 0.015745 -0.020443 0.154458 0.044032 -
0.071080
V27 0.182976 -0.173336 0.106629 -0.025870 0.196096 -
0.167733
V28 0.210921 0.070419 0.131632 -0.092189 0.145784 -
0.018021
Class -0.429169 0.480641 -0.572221 0.712172 -0.379339 -
0.415159
Time_Scaled 0.233966 -0.205670 0.141509 -0.182244 0.267611
0.093709
Amount_Scaled -0.025660 -0.188290 -0.015966 0.012432 -0.117656
0.143857

V7 V8 V9 V10 ...
V22 \
V1 0.887652 -0.083680 0.653440 0.733944 ... -0.033194

V2 -0.835954 -0.026762 -0.693041 -0.762334 ... 0.000708

V3 0.891257 -0.169888 0.768250 0.855893 ... -0.065118

V4 -0.712186 0.099635 -0.793418 -0.799664 ... 0.139337

V5 0.840496 -0.203067 0.663338 0.755953 ... -0.094804

V6 0.287024 -0.559226 0.380462 0.430772 ... -0.000066

V7 1.000000 0.089940 0.767442 0.865092 ... -0.121329

V8 0.089940 1.000000 -0.070890 -0.048285 ... 0.041478

V9 0.767442 -0.070890 1.000000 0.857070 ... -0.251200

V10 0.865092 -0.048285 0.857070 1.000000 ... -0.220788

V11 -0.639532 0.165150 -0.700504 -0.803238 ... 0.030213

V12 0.719924 -0.161743 0.768951 0.880217 ... -0.117970

V13 -0.022346 0.259218 -0.030458 -0.046796 ... 0.040720

V14 0.542494 -0.181565 0.681104 0.757343 ... 0.054871

V15 0.185078 0.140966 0.134018 0.162335 ... -0.076319

V16 0.745471 -0.168061 0.732159 0.856352 ... -0.108700

V17 0.769965 -0.216994 0.762331 0.852113 ... -0.130152

V18 0.762279 -0.178820 0.709138 0.800112 ... -0.126916

V19 -0.350501 0.212165 -0.324777 -0.412760 ... 0.131963

V20 -0.394137 -0.027319 -0.386982 -0.380672 ... 0.453897

V21 0.037287 -0.126128 0.158199 0.079822 ... -0.741836

V22 -0.121329 0.041478 -0.251200 -0.220788 ... 1.000000

V23 -0.087151 -0.429228 -0.051086 -0.056170 ... 0.006539

V24 -0.067926 0.078185 -0.013090 -0.042476 ... 0.052552

V25 0.034442 0.236750 -0.021308 0.007749 ... -0.239406

V26 0.014072 0.050419 -0.133493 -0.043234 ... 0.026811

V27 0.237781 0.288612 0.147273 0.168796 ... -0.377779

V28 0.165772 -0.017866 0.139660 0.141032 ... -0.233833

Class -0.475549 0.051612 -0.562707 -0.629284 ... 0.021711

Time_Scaled 0.206273 -0.139637 0.148406 0.200110 ... 0.121564

Amount_Scaled 0.104009 0.030666 0.012622 -0.015371 ... -0.019122

V23 V24 V25 V26 V27

V28 \
V1 -0.033872 -0.082214 -0.086378 0.041588 0.182976
0.210921
V2 0.172902 0.000592 0.151777 0.015745 -0.173336
0.070419
V3 -0.032930 -0.026072 -0.115695 -0.020443 0.106629
0.131632
V4 0.021456 -0.018264 -0.007283 0.154458 -0.025870 -
0.092189
V5 -0.093423 -0.142346 -0.114951 0.044032 0.196096
0.145784
V6 0.317014 -0.075999 -0.109415 -0.071080 -0.167733 -
0.018021
V7 -0.087151 -0.067926 0.034442 0.014072 0.237781
0.165772
V8 -0.429228 0.078185 0.236750 0.050419 0.288612 -
0.017866
V9 -0.051086 -0.013090 -0.021308 -0.133493 0.147273
0.139660
V10 -0.056170 -0.042476 0.007749 -0.043234 0.168796
0.141032
V11 -0.021082 -0.064055 0.048498 0.177652 0.158104
0.026327
V12 0.007135 0.005323 0.009348 -0.136648 -0.011180
0.005993
V13 -0.113288 0.052171 -0.002512 0.062089 0.051641 -
0.121680
V14 0.014327 0.104930 -0.110871 -0.196400 -0.189429 -
0.121756
V15 -0.057945 0.053512 -0.011405 0.050610 0.171116
0.090895
V16 -0.004243 -0.085562 0.037555 -0.080057 -0.012745
0.011667
V17 0.011588 -0.099562 0.014490 -0.065053 0.012031
0.053926
V18 0.015661 -0.117250 0.045414 -0.030568 0.063627
0.101011
V19 -0.009816 0.148637 -0.148235 0.041973 0.050118 -
0.051332
V20 0.151163 -0.012711 0.046577 0.004348 -0.140048
0.052518
V21 0.129405 -0.040254 0.130161 0.034475 0.364754
0.291603
V22 0.006539 0.052552 -0.239406 0.026811 -0.377779 -
0.233833
V23 1.000000 -0.039209 0.062201 0.029965 -0.206001
0.121729
V24 -0.039209 1.000000 -0.066769 -0.131888 -0.162779 -
0.076889
V25 0.062201 -0.066769 1.000000 0.072843 0.201343
0.184463
V26 0.029965 -0.131888 0.072843 1.000000 0.175084
0.036473
V27 -0.206001 -0.162779 0.201343 0.175084 1.000000
0.247067
V28 0.121729 -0.076889 0.184463 0.036473 0.247067
1.000000
Class -0.013552 -0.035789 0.067267 0.082742 0.062988
0.079099
Time_Scaled 0.069392 -0.007298 -0.225031 -0.052776 -0.127489
0.028472
Amount_Scaled -0.164602 0.066496 -0.089644 -0.061133 0.096729 -
0.099605

Class Time_Scaled Amount_Scaled

V1 -0.429169 0.233966 -0.025660
V2 0.480641 -0.205670 -0.188290
V3 -0.572221 0.141509 -0.015966
V4 0.712172 -0.182244 0.012432
V5 -0.379339 0.267611 -0.117656
V6 -0.415159 0.093709 0.143857
V7 -0.475549 0.206273 0.104009
V8 0.051612 -0.139637 0.030666
V9 -0.562707 0.148406 0.012622
V10 -0.629284 0.200110 -0.015371
V11 0.689070 -0.281509 -0.005058
V12 -0.678469 0.250983 0.001248
V13 -0.053693 -0.100751 0.020597
V14 -0.749321 0.157581 0.007144
V15 -0.046987 -0.134271 0.068454
V16 -0.598659 0.225081 -0.047402
V17 -0.558772 0.227698 -0.044001
V18 -0.466473 0.254300 -0.035314
V19 0.264262 -0.103809 0.098815
V20 0.183132 -0.033162 0.085702
V21 0.127336 -0.043639 0.023724
V22 0.021711 0.121564 -0.019122
V23 -0.013552 0.069392 -0.164602
V24 -0.035789 -0.007298 0.066496
V25 0.067267 -0.225031 -0.089644
V26 0.082742 -0.052776 -0.061133
V27 0.062988 -0.127489 0.096729
V28 0.079099 0.028472 -0.099605
Class 1.000000 -0.118312 0.123575
Time_Scaled -0.118312 1.000000 0.025937
Amount_Scaled 0.123575 0.025937 1.000000

[31 rows x 31 columns]

Difficult to understand by just look into huge matrix so we draw a heatmap for better
understanding using visualization

plt.figure(figsize=(30,10))
sns.heatmap(corr,annot=True,cmap='BrBG')
plt.show()
From above heatmap we can conclude that V10,V12,V14,V16 features are highly correlated with
class in negative terms and in positive terms features V2,V4,V11,V19 are strongly correlated with
class than any other feature tell check how actually they affect Fraud and Non-Fraud
Classification.

fig,ax=plt.subplots(2,4,figsize=(30,10))
ax=ax.ravel()
sns.boxplot(x='Class',y='V16',data=new_df,ax=ax[0])
ax[0].set_title('Class vs V16 "Negative Correlation"')

sns.boxplot(x='Class',y='V14',data=new_df,ax=ax[1])
ax[1].set_title('Class vs V14 "Negative Correlation"')

sns.boxplot(x='Class',y='V12',data=new_df,ax=ax[2])
ax[2].set_title('Class vs V12 "Negative Correlation"')

sns.boxplot(x='Class',y='V10',data=new_df,ax=ax[3])
ax[3].set_title('Class vs V10 "Negative Correlation"')

sns.kdeplot(new_df['V16'].loc[new_df['Class']==1],color='#f83e07',ax=a
x[4])
ax[4].set_title('V16 Fraud Case Distribution')

sns.kdeplot(new_df['V14'].loc[new_df['Class']==1],color='#f8a407',ax=a
x[5])
ax[5].set_title('V14 Fraud Case Distribution')

sns.kdeplot(new_df['V12'].loc[new_df['Class']==1],color='#0762f8',ax=a
x[6])
ax[6].set_title('V12 Fraud Case Distribution')

sns.kdeplot(new_df['V10'].loc[new_df['Class']==1],color='#a007f8',ax=a
x[7])
ax[7].set_title('V10 Fraud Case Distribution')

plt.show()
In terms of negative correlation the distribution of V14 is only normal distributed as compare to
other negitively correlated features like V12,V16 and in V10 there is significant more number of
outliers in Fraud case as compare to others and if we talk about Non-Fraud case we find that
there are approximetly equal number of outliers.

fig,ax=plt.subplots(2,4,figsize=(30,10))

ax=ax.ravel()

sns.boxplot(x='Class',y='V2',data=new_df,ax=ax[0])
ax[0].set_title('Class vs V2 "Positive Correlation"')

sns.boxplot(x='Class',y='V4',data=new_df,ax=ax[1])
ax[1].set_title('Class vs V4 "Positive Correlation"')

sns.boxplot(x='Class',y='V11',data=new_df,ax=ax[2])
ax[2].set_title('Class vs V11 "Positive Correlation"')

sns.boxplot(x='Class',y='V19',data=new_df,ax=ax[3])
ax[3].set_title('Class vs V19 "Positive Correlation"')

sns.kdeplot(new_df['V2'].loc[new_df['Class']==1],color='#f83e07',ax=ax
[4])
ax[4].set_title('V2 Fraud Case Distribution')

sns.kdeplot(new_df['V4'].loc[new_df['Class']==1],color='#f8a407',ax=ax
[5])
ax[5].set_title('V4 Fraud Case Distribution')

sns.kdeplot(new_df['V11'].loc[new_df['Class']==1],color='#0762f8',ax=a
x[6])
ax[6].set_title('V11 Fraud Case Distribution')

sns.kdeplot(new_df['V19'].loc[new_df['Class']==1],color='#a007f8',ax=a
x[7])
ax[7].set_title('V19 Fraud Case Distribution')

plt.show()
only feature V19 has normal distribution of fraud cases as compare to rest of positively
correlated features.

# Let's perform IQR method to catch outliers and drop them from
dataframe and then split the data for train test split
print("....... For V10...... ")
v10_fraud=new_df['V10'].loc[new_df['Class']==1].values
Q1_v10,Q3_v10=np.percentile(v10_fraud,25),np.percentile(v10_fraud,75)
print(f'25 Percentile: {round(Q1_v10,4)}')
print(f'75 Percentile: {round(Q3_v10,4)}')

IQR_v10=Q3_v10-Q1_v10
print()
print("IQR",IQR_v10)
v10_cutoff=IQR_v10*1.5
print('Cutoff Value',IQR_v10*1.5)
lower_bound_v10,upper_bound_v10=Q1_v10-v10_cutoff,Q3_v10+v10_cutoff
print()
print('Lower Bound ',lower_bound_v10)
print('Upper Bound',upper_bound_v10)

outliers=[x for x in v10_fraud if x < lower_bound_v10 or x >

upper_bound_v10]

print("Outliers of V10",outliers)
print('Number of outliers:',len(outliers))
new_df = new_df.drop(new_df[(new_df['V10'] > upper_bound_v10) |
(new_df['V10'] < lower_bound_v10)].index)
print('Number of Instances after outliers removal:
{}'.format(len(new_df)))

print()
print("....... For V12...... ")
v12_fraud=new_df['V12'].loc[new_df['Class']==1].values
Q1_v12,Q3_v12=np.percentile(v12_fraud,25),np.percentile(v12_fraud,75)
print(f'25 Percentile: {round(Q1_v12,4)}')
print(f'75 Percentile: {round(Q3_v12,4)}')

IQR_v12=Q3_v12-Q1_v12
print()
print("IQR",IQR_v12)
v12_cutoff=IQR_v12*1.5
print('Cutoff Value',IQR_v12*1.5)
lower_bound_v12,upper_bound_v12=Q1_v12-v12_cutoff,Q3_v12+v12_cutoff
print()
print('Lower Bound ',lower_bound_v12)
print('Upper Bound',upper_bound_v12)

outliers=[x for x in v12_fraud if x < lower_bound_v12 or x >

upper_bound_v12]

print("Outliers of V12",outliers)
print('Number of outliers:',len(outliers))
new_df = new_df.drop(new_df[(new_df['V12'] > upper_bound_v12) |
(new_df['V12'] < lower_bound_v12)].index)
print('Number of Instances after outliers removal:
{}'.format(len(new_df)))

print()
print("....... For V14...... ")
v14_fraud=new_df['V14'].loc[new_df['Class']==1].values
Q1_v14,Q3_v14=np.percentile(v14_fraud,25),np.percentile(v14_fraud,75)
print(f'25 Percentile: {round(Q1_v14,4)}')
print(f'75 Percentile: {round(Q3_v14,4)}')

IQR_v14=Q3_v14-Q1_v14
print()
print("IQR",IQR_v14)
v14_cutoff=IQR_v14*1.5
print('Cutoff Value',IQR_v14*1.5)
lower_bound_v14,upper_bound_v14=Q1_v14-v14_cutoff,Q3_v14+v14_cutoff
print()
print('Lower Bound ',lower_bound_v14)
print('Upper Bound',upper_bound_v14)

outliers=[x for x in v14_fraud if x < lower_bound_v14 or x >

upper_bound_v14]

print("Outliers of V12",outliers)
print('Number of outliers:',len(outliers))
new_df = new_df.drop(new_df[(new_df['V14'] > upper_bound_v14) |
(new_df['V14'] < lower_bound_v14)].index)
print('Number of Instances after outliers removal:
{}'.format(len(new_df)))
....... For V10......
25 Percentile: -7.7567
75 Percentile: -2.6142

IQR 5.142514314657911
Cutoff Value 7.713771471986866

Lower Bound -15.47046969983434

Upper Bound 5.099587558797303
Outliers of V10 [-16.3035376590131, -23.2282548357516, -
20.9491915543611, -24.4031849699728, -18.9132433348732, -
22.1870885620007, -17.1415136412892, -16.6011969664137, -
22.1870885620007, -15.5637913387301, -15.5637913387301, -
24.5882624372475, -18.2711681738888, -16.2556117491401, -
22.1870885620007, -16.7460441053944, -16.6496281595399, -
19.836148851696, -22.1870885620007]
Number of outliers: 19
Number of Instances after outliers removal: 963

....... For V12......

25 Percentile: -8.464
75 Percentile: -2.8249

IQR 5.63902050475877
Cutoff Value 8.458530757138156

Lower Bound -16.922496886663865

Upper Bound 5.633585132371216
Outliers of V12 [-17.003289445516, -18.4311310279993, -
17.7691434633638, -18.6837146333443, -17.1829184301947, -
17.2286622386187, -17.1829184301947, -17.1504052507291, -
17.003289445516, -18.0475965708216, -17.1313009454468, -
18.5536970096458, -17.6316063138707]
Number of outliers: 13
Number of Instances after outliers removal: 950

....... For V14......

25 Percentile: -9.4142
75 Percentile: -4.2889

IQR 5.125221763769894
Cutoff Value 7.687832645654842

Lower Bound -17.101993690511843

Upper Bound 3.3988933645677344
Outliers of V12 [-17.4759212828566, -17.7216383537133, -
17.6206343516773, 3.44242199594215, -17.230202160711]
Number of outliers: 5
Number of Instances after outliers removal: 944
fig,ax=plt.subplots(1,3,figsize=(25,6))
sns.boxplot(x='Class',y='V14',data=new_df,ax=ax[0])
ax[0].set_title('Class vs V14 "Negative Correlation"')

sns.boxplot(x='Class',y='V12',data=new_df,ax=ax[1])
ax[1].set_title('Class vs V12 "Negative Correlation"')

sns.boxplot(x='Class',y='V10',data=new_df,ax=ax[2])
ax[2].set_title('Class vs V10 "Negative Correlation"')
plt.show()

Remaining outliers are extreme outliers if we try to remove them also we will loose
identification of feature

X_undersample=new_df.drop(columns='Class')
y_undersample=new_df['Class']

from sklearn.model_selection import train_test_split

x_train_us,x_test_us,y_train_us,y_test_us=train_test_split(X_undersamp
le,y_undersample,test_size=0.2,random_state=42)

from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, f1_score,
roc_auc_score, accuracy_score, classification_report
from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay
from sklearn.model_selection import cross_val_score

print('--------------------Logistic Regression-------------------')
lr=LogisticRegression()
lr.fit(x_train_us,y_train_us)
y_pred_lr=lr.predict(x_test_us)
print("ROC AUC Score",roc_auc_score(y_pred_lr,y_test_us))
print(classification_report(y_test_us,y_pred_lr))
print()
print('---------------------------SVM------------------------------')
svc=SVC()
svc.fit(x_train_us,y_train_us)
y_pred_svc=svc.predict(x_test_us)
print("ROC AUC Score",roc_auc_score(y_pred_svc,y_test_us))
print(classification_report(y_test_us,y_pred_svc))

print()
print('--------------------------Knn------------------------------')
knn=KNeighborsClassifier()
knn.fit(x_train_us,y_train_us)
y_pred_knn=knn.predict(x_test_us)
print("ROC AUC Score",roc_auc_score(y_pred_knn,y_test_us))
print(classification_report(y_test_us,y_pred_knn))

print()
print('----------------------GaussianNB----------------------')
gnb=GaussianNB()
gnb.fit(x_train_us,y_train_us)
y_pred_gnb=gnb.predict(x_test_us)
print("ROC AUC Score",roc_auc_score(y_pred_gnb,y_test_us))
print(classification_report(y_test_us,y_pred_gnb))

print()
print('------------------RandomForestClassifier----------------')
rfc=RandomForestClassifier()
rfc.fit(x_train_us,y_train_us)
y_pred_rfc=rfc.predict(x_test_us)
print("ROC AUC Score",roc_auc_score(y_pred_rfc,y_test_us))
print(classification_report(y_test_us,y_pred_rfc))

--------------------Logistic Regression-------------------
ROC AUC Score 0.9595238095238096
precision recall f1-score support

0 0.94 0.98 0.96 101

1 0.98 0.93 0.95 88

accuracy 0.96 189

macro avg 0.96 0.96 0.96 189
weighted avg 0.96 0.96 0.96 189

---------------------------SVM------------------------------
ROC AUC Score 0.950421700478687
precision recall f1-score support

0 0.93 0.98 0.95 101

1 0.98 0.91 0.94 88

accuracy 0.95 189

macro avg 0.95 0.94 0.95 189
weighted avg 0.95 0.95 0.95 189

--------------------------Knn------------------------------
ROC AUC Score 0.9459876543209875
precision recall f1-score support

0 0.92 0.98 0.95 101

1 0.98 0.90 0.93 88

accuracy 0.94 189

macro avg 0.95 0.94 0.94 189
weighted avg 0.94 0.94 0.94 189

----------------------GaussianNB----------------------
ROC AUC Score 0.9288807841349441
precision recall f1-score support

0 0.91 0.96 0.93 101

1 0.95 0.89 0.92 88

accuracy 0.93 189

macro avg 0.93 0.92 0.93 189
weighted avg 0.93 0.93 0.93 189

------------------RandomForestClassifier----------------
ROC AUC Score 0.9641968325791855
precision recall f1-score support

0 0.95 0.98 0.97 101

1 0.98 0.94 0.96 88

accuracy 0.96 189

macro avg 0.96 0.96 0.96 189
weighted avg 0.96 0.96 0.96 189

If you use cross validation score in above code then you can get better ROC AUC score than
above ROC AUC score because you do train and test more times and it make instances more
visible to model.

fig, ax = plt.subplots(1, 5, figsize=(40, 8))

# Logistic Regressor
ConfusionMatrixDisplay.from_predictions(y_test_us, y_pred_lr,
ax=ax[0])
ax[0].set_title('Confusion Matrix of Logistic Regressor')

# Support Vector Classifier

ConfusionMatrixDisplay.from_predictions(y_test_us, y_pred_svc,
ax=ax[1])
ax[1].set_title('Confusion Matrix of SVC')

# K-Nearest Neighbors
ConfusionMatrixDisplay.from_predictions(y_test_us, y_pred_knn,
ax=ax[2])
ax[2].set_title('Confusion Matrix of KNN')

# Gaussian Naive Bayes

ConfusionMatrixDisplay.from_predictions(y_test_us, y_pred_gnb,
ax=ax[3])
ax[3].set_title('Confusion Matrix of GaussianNB')

# Random Forest Classifier

ConfusionMatrixDisplay.from_predictions(y_test_us, y_pred_rfc,
ax=ax[4])

ax[4].set_title('Confusion Matrix of Random Forest')

plt.tight_layout()
plt.show()

from sklearn.model_selection import GridSearchCV

# Logistic Regression
log_params = {"penalty": ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10,
100, 1000]}
grid_log_reg = GridSearchCV(LogisticRegression(), log_params,cv=5)
grid_log_reg.fit(x_train_us, y_train_us)
# We automatically get the logistic regression with the best
parameters.
log_reg = grid_log_reg.best_estimator_
log_reg

LogisticRegression(C=0.1)
tree_params = {"criterion": ["gini", "entropy"], "max_depth":
[4,6,2,None]}
grid_tree = GridSearchCV(RandomForestClassifier(),
tree_params,n_jobs=-1,cv=5)
grid_tree.fit(x_train_us, y_train_us)
grid_rfc = grid_tree.best_estimator_
print(grid_rfc)

RandomForestClassifier(criterion='entropy')

plt.figure(figsize=(10,6))
from sklearn.metrics import roc_curve, auc
y_score_lr = log_reg.predict_proba(x_test_us) # Logistic Regression
with hypertunning
y_score_svc = svc.decision_function(x_test_us) # Support Vector
Classifier
y_score_knn = knn.predict_proba(x_test_us) # K-Nearest Neighbors
y_score_gnb = gnb.predict_proba(x_test_us) # Gaussian Naive Bayes
y_score_rfc = grid_rfc.predict_proba(x_test_us) # Random Forest with
hypertunning

# Logistic Regressor
fpr_lr, tpr_lr, _ = roc_curve(y_test_us, y_score_lr[:, 1])
roc_auc_lr = auc(fpr_lr, tpr_lr)
plt.plot(fpr_lr, tpr_lr, label=f'Logistic Regressor (AUC =
{roc_auc_lr:.2f})')

# Support Vector Classifier

fpr_svc, tpr_svc, _ = roc_curve(y_test_us, y_score_svc)
roc_auc_svc = auc(fpr_svc, tpr_svc)
plt.plot(fpr_svc, tpr_svc, label=f'SVC (AUC = {roc_auc_svc:.2f})')

# K-Nearest Neighbors
fpr_knn, tpr_knn, _ = roc_curve(y_test_us, y_score_knn[:, 1])
roc_auc_knn = auc(fpr_knn, tpr_knn)
plt.plot(fpr_knn, tpr_knn, label=f'KNN (AUC = {roc_auc_knn:.2f})')

# Gaussian Naive Bayes

fpr_gnb, tpr_gnb, _ = roc_curve(y_test_us, y_score_gnb[:, 1])
roc_auc_gnb = auc(fpr_gnb, tpr_gnb)
plt.plot(fpr_gnb, tpr_gnb, label=f'GaussianNB (AUC =
{roc_auc_gnb:.2f})')

# Random Forest Classifier

fpr_rfc, tpr_rfc, _ = roc_curve(y_test_us, y_score_rfc[:, 1])
roc_auc_rfc = auc(fpr_rfc, tpr_rfc)
plt.plot(fpr_rfc, tpr_rfc, label=f'Random Forest (AUC =
{roc_auc_rfc:.2f})')
# Plot settings
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', label='Chance')

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')
plt.title('ROC Curves for Various Models')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.show()

from sklearn.model_selection import learning_curve

from sklearn.model_selection import ShuffleSplit
cv=ShuffleSplit(n_splits=70,test_size=0.2,random_state=42)
fig,ax=plt.subplots(2,3,figsize=(25,10))
ax=ax.ravel()

# for logistic regression

train_sizes1, train_scores1,
validation_scores1=learning_curve(log_reg,X_undersample,y_undersample,
cv=cv)
train_score_mean1=train_scores1.mean(axis=1)
validation_score_mean1=validation_scores1.mean(axis=1)
ax[0].plot(train_sizes1, train_score_mean1, label = 'Training error')
ax[0].plot(train_sizes1, validation_score_mean1, label = 'Validation
error')
ax[0].set_xlabel('Training set size')
ax[0].set_ylabel('Score')
ax[0].set_title('Learning curves for a logistic regression model')
ax[0].legend()
plt.tight_layout()

# for svc
train_sizes2, train_scores2,
validation_scores2=learning_curve(svc,X_undersample,y_undersample,cv=c
v)
train_score_mean2=train_scores2.mean(axis=1)
validation_score_mean2=validation_scores2.mean(axis=1)

ax[1].plot(train_sizes2, train_score_mean2, label = 'Training error')

ax[1].plot(train_sizes2, validation_score_mean2, label = 'Validation
error')
ax[1].set_xlabel('Training set size')
ax[1].set_ylabel('Score')
ax[1].set_title('Learning curves for a svc model')
ax[1].legend()
plt.tight_layout()

#for KneighbourClassifier
train_sizes3, train_scores3,
validation_scores3=learning_curve(knn,X_undersample,y_undersample,cv=c
v)
train_score_mean3=train_scores3.mean(axis=1)
validation_score_mean3=validation_scores3.mean(axis=1)
ax[2].plot(train_sizes3, train_score_mean3, label = 'Training error')
ax[2].plot(train_sizes3, validation_score_mean3, label = 'Validation
error')
ax[2].set_xlabel('Training set size')
ax[2].set_ylabel('Score')
ax[2].set_title('Learning curves for a KneighboursClassifier model')
ax[2].legend()
plt.tight_layout()

#For GuassainNB
train_sizes4, train_scores4,
validation_scores4=learning_curve(gnb,X_undersample,y_undersample,cv=c
v)
train_score_mean4=train_scores4.mean(axis=1)
validation_score_mean4=validation_scores4.mean(axis=1)
ax[3].plot(train_sizes4, train_score_mean4, label = 'Training error')
ax[3].plot(train_sizes4, validation_score_mean4, label = 'Validation
error')
ax[3].set_xlabel('Training set size')
ax[3].set_ylabel('Score')
ax[3].set_title('Learning curves for a GuassainNB model')
ax[3].legend()
plt.tight_layout()

#For RandomForestRegressor
train_sizes5, train_scores5,
validation_scores5=learning_curve(grid_rfc,X_undersample,y_undersample
,cv=cv)
train_score_mean5=train_scores5.mean(axis=1)
validation_score_mean5=validation_scores5.mean(axis=1)
ax[4].plot(train_sizes5, train_score_mean5, label = 'Training error')
ax[4].plot(train_sizes5, validation_score_mean5, label = 'Validation
error')
ax[4].set_xlabel('Training set size')
ax[4].set_ylabel('Score')
ax[4].set_title('Learning curves for a RandomForestRegressor model')
ax[4].legend()
plt.tight_layout()

plt.show()

from sklearn.metrics import precision_recall_curve, auc

plt.figure(figsize=(10,5))
# Logistic Regressor
precision_lr, recall_lr, thresholds_lr =
precision_recall_curve(y_test_us, y_pred_lr)
pr_auc_lr = auc(recall_lr, precision_lr)
plt.plot(recall_lr, precision_lr, label=f'PR Curve LR (AUC =
{pr_auc_lr:.2f})')

# Support Vector Classifier

precision_svc, recall_svc, thresholds_svc =
precision_recall_curve(y_test_us, y_pred_svc)
pr_auc_svc = auc(recall_svc, precision_svc)
plt.plot(recall_svc, precision_svc, label=f'PR Curve SVC (AUC =
{pr_auc_svc:.2f})')

# K-Nearest Neighbors
precision_knn, recall_knn, thresholds_knn =
precision_recall_curve(y_test_us, y_pred_knn)
pr_auc_knn = auc(recall_knn, precision_knn)
plt.plot(recall_knn, precision_knn, label=f'PR Curve KNN (AUC =
{pr_auc_knn:.2f})')

# Gaussian Naive Bayes

precision_gnb, recall_gnb, thresholds_gnb =
precision_recall_curve(y_test_us, y_pred_gnb)
pr_auc_gnb = auc(recall_gnb, precision_gnb)
plt.plot(recall_gnb, precision_gnb, label=f'PR Curve GNB (AUC =
{pr_auc_gnb:.2f})')

# Random Forest Classifier

precision_rfc, recall_rfc, thresholds_rfc =
precision_recall_curve(y_test_us, y_pred_rfc)
pr_auc_rfc = auc(recall_rfc, precision_rfc)
plt.plot(recall_rfc, precision_rfc, label=f'PR Curve RFC (AUC =
{pr_auc_rfc:.2f})')

plt.title('Precision-Recall Curve of Different Classifier ')

plt.legend()
plt.grid()
plt.show()
Your PR curve score can vary when you run this code because i do sample operation on dataset
which make it random and all results are which are related to this dataframe also can vary due to
same reason.

log_reg.fit(x_train,y_train)

LogisticRegression(C=0.1)

grid_rfc.fit(x_train,y_train)

RandomForestClassifier(criterion='entropy')

y_pred_original_lr=log_reg.predict(x_test)
y_pred_original_rfc=grid_rfc.predict(x_test)

#svc.fit(x_train,y_train)

#y_pred_original_svc=svc.predict(x_test)

fig,ax=plt.subplots(1,3,figsize=(20,5))
ConfusionMatrixDisplay.from_predictions(y_test,
y_pred_original_lr,ax=ax[0])
ConfusionMatrixDisplay.from_predictions(y_test,
y_pred_original_rfc,ax=ax[1])
#ConfusionMatrixDisplay.from_predictions(y_test,
y_pred_original_svc,ax=ax[2])
plt.show()
This time RANDOMFORESTCLASSIFER work better than LOGISTICREGRESSOR

OVERSAMPLING USING SMOTE

# Install the imblearn_

!pip install imbalanced-learn scikit-learn

Requirement already satisfied: imbalanced-learn in

/usr/local/lib/python3.10/dist-packages (0.12.3)
Requirement already satisfied: scikit-learn in
/usr/local/lib/python3.10/dist-packages (1.2.2)
Requirement already satisfied: numpy>=1.17.3 in
/usr/local/lib/python3.10/dist-packages (from imbalanced-learn)
(1.26.4)
Requirement already satisfied: scipy>=1.5.0 in
/usr/local/lib/python3.10/dist-packages (from imbalanced-learn)
(1.13.1)
Requirement already satisfied: joblib>=1.1.1 in
/usr/local/lib/python3.10/dist-packages (from imbalanced-learn)
(1.4.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in
/usr/local/lib/python3.10/dist-packages (from imbalanced-learn)
(3.5.0)

from imblearn.over_sampling import SMOTE

print(f'Original Train Size:{len(x_train)} ')

print(f'Original Test Size:{len(x_test)} ')

Original Train Size:227846

Original Test Size:56961

SMOTE generates synthetic samples only for the minority class to balance it. These synthetic
samples are artificially created based on the training data. If SMOTE is applied before splitting,
the synthetic samples might "leak" into the test set. The model could then learn these synthetic
patterns and perform unrealistically well during testing.

sm=SMOTE(random_state=42)

x_train_resampled,y_train_resampled=sm.fit_resample(x_train,y_train)
from sklearn.model_selection import RandomizedSearchCV

log_reg_params = {"penalty": ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1,

10, 100, 1000]}
rand_log_reg_sm = RandomizedSearchCV(LogisticRegression(),
log_reg_params)
rand_log_reg_sm.fit(x_train_resampled,y_train_resampled)
log_reg_sm = rand_log_reg_sm.best_estimator_
log_reg_sm

LogisticRegression(C=0.1)

logi_sm=LogisticRegression(C=0.001) #please check C value from above

code then replace the value in this code
logi_sm.fit(x_train_resampled,y_train_resampled)

LogisticRegression(C=0.001)

y_pred_sm_us=logi_sm.predict_proba(x_test_us)

y_pred_sm_us1=logi_sm.predict(x_test_us)

print("ROC AUC Score",roc_auc_score(y_pred_sm_us1,y_test_us))

print('Recall',recall_score(y_pred_sm_us1,y_test_us))
print('precision',precision_score(y_pred_sm_us1,y_test_us))

print(classification_report(y_test_us,y_pred_sm_us1))
print()

ROC AUC Score 0.9719626168224299

Recall 1.0
precision 0.9318181818181818
precision recall f1-score support

0 0.94 1.00 0.97 101

1 1.00 0.93 0.96 88

accuracy 0.97 189

macro avg 0.97 0.97 0.97 189
weighted avg 0.97 0.97 0.97 189

ConfusionMatrixDisplay.from_predictions(y_test_us, y_pred_sm_us1)

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at
0x79070eddd5d0>
precision_lr, recall_lr, thresholds_lr =
precision_recall_curve(y_test_us, y_pred_sm_us1)
pr_auc_sm_us = auc(recall_lr, precision_lr)
plt.plot(recall_lr, precision_lr, label=f'PR Curve LR (AUC =
{pr_auc_sm_us:.2f})')
plt.legend(loc='best')
plt.show()
y_score_sm_us = log_reg.predict_proba(x_test_us)
fpr_lr, tpr_lr, _ = roc_curve(y_test_us, y_score_sm_us[:, 1])
roc_auc_sm_us = auc(fpr_lr, tpr_lr)
plt.plot(fpr_lr, tpr_lr, label=f'Logistic Regressor (AUC =
{roc_auc_sm_us:.2f})')
plt.legend(loc='best')
plt.show()
pred=logi_sm.predict(x_test)

ConfusionMatrixDisplay.from_predictions(y_test,pred)
plt.show()
rfc_sm=RandomForestClassifier(criterion='entropy')

rfc_sm.fit(x_train_resampled,y_train_resampled)

RandomForestClassifier(criterion='entropy')

rfc_pred_sm=rfc_sm.predict(x_test)

ConfusionMatrixDisplay.from_predictions(y_test,rfc_pred_sm)
plt.show()
Final Decision:

In our analysis, we employed both undersampling and oversampling techniques to address class
imbalance. When using undersampling, we observed a higher number of false negatives (FN),
where actual fraud cases were misclassified as non-fraudulent. This is a significant concern as it
compromises fraud detection, affecting both the company and its users. On the other hand,
oversampling showed an improvement in reducing false negatives but introduced more false
positives (FP), where non-fraudulent cases were incorrectly flagged as fraudulent. False
positives, though less critical than false negatives, can harm the company’s reputation by
inconveniencing users, potentially leading to user dissatisfaction and service abandonment.

When we applied Logistic Regression as our model, undersampling resulted in fewer false
positives (only 3 cases), but the false negatives were significantly higher. This is problematic for
fraud detection as undetected fraudulent activities directly impact business operations and user
trust. Conversely, oversampling with Logistic Regression performed worse in terms of false
positives, which, though reduced fraud risk, increased customer dissatisfaction due to
unwarranted interventions.

Using Random Forest Classifier, however, yielded much better results. It outperformed Logistic
Regression in both undersampling and oversampling scenarios. Particularly with oversampling,
the Random Forest Classifier achieved the best performance, demonstrating the lowest number
of false negatives and just one false positive. This balance ensures robust fraud detection while
maintaining customer trust and minimizing inconvenience.
Based on these findings, we conclude that the Random Forest Classifier with oversampling is
the most suitable approach for our fraud detection system. Moving forward, further refinement
and optimization of this model could help achieve even greater accuracy and reliability.

NA To SS en 1997-1 2010 - Singapore National Annex To Eurocode 7
100% (2)
NA To SS en 1997-1 2010 - Singapore National Annex To Eurocode 7
26 pages
Import: Sys - Executable - M Pip Install
No ratings yet
Import: Sys - Executable - M Pip Install
23 pages
Tabel Distribusi Normal Baku
No ratings yet
Tabel Distribusi Normal Baku
1 page
Website Template For MSC by Coursework - ODL MSC Process Safety
No ratings yet
Website Template For MSC by Coursework - ODL MSC Process Safety
19 pages
Credit Card Fraud Detection
No ratings yet
Credit Card Fraud Detection
101 pages
Xtasy
No ratings yet
Xtasy
14 pages
Fraud Transaction Detection - Ipynb - Colab - Rameshkumar
No ratings yet
Fraud Transaction Detection - Ipynb - Colab - Rameshkumar
7 pages
Credit-Card - Notebooks - Preprocessed-Data - Data - Preprocessing - Ipynb at Main Shubhamdongarjal - Credit-Card
No ratings yet
Credit-Card - Notebooks - Preprocessed-Data - Data - Preprocessing - Ipynb at Main Shubhamdongarjal - Credit-Card
15 pages
Credit Card Fraud Detection
No ratings yet
Credit Card Fraud Detection
8 pages
Credit Card Fraud Detection With CNN 99 Accuracy
No ratings yet
Credit Card Fraud Detection With CNN 99 Accuracy
12 pages
Mlcreditcardfraud
No ratings yet
Mlcreditcardfraud
71 pages
EDA and Similarity of Transactions On CreditCardFraudDetection
No ratings yet
EDA and Similarity of Transactions On CreditCardFraudDetection
66 pages
Observation: As We Can See We Have Threwe Types of Datatypes I.E. (Int, Float, Object) That Means We Have Both Categorical and Numerical Data
No ratings yet
Observation: As We Can See We Have Threwe Types of Datatypes I.E. (Int, Float, Object) That Means We Have Both Categorical and Numerical Data
2 pages
Capstone Report: FIRST NAME: Gopalakrishnan LAST NAME: Kalarikovilagam Subramanian M12821535
No ratings yet
Capstone Report: FIRST NAME: Gopalakrishnan LAST NAME: Kalarikovilagam Subramanian M12821535
17 pages
Terro's REA
No ratings yet
Terro's REA
43 pages
Geank
No ratings yet
Geank
17 pages
E21CSEU0770 Lab4
No ratings yet
E21CSEU0770 Lab4
4 pages
HW8 La
No ratings yet
HW8 La
18 pages
Prg7a - Jupyter Notebook
No ratings yet
Prg7a - Jupyter Notebook
12 pages
Cern Electron Mass Prediction 0 9859 R
No ratings yet
Cern Electron Mass Prediction 0 9859 R
53 pages
Credit - Card - Fraud - Detection Using ML - Jupyter Notebook
No ratings yet
Credit - Card - Fraud - Detection Using ML - Jupyter Notebook
12 pages
Alam-Project3 1
No ratings yet
Alam-Project3 1
16 pages
Curentul Electric in Functie de Radacina Patrata A Tensiunii de Franare
No ratings yet
Curentul Electric in Functie de Radacina Patrata A Tensiunii de Franare
5 pages
Analyzing The Ionosphere Using R
No ratings yet
Analyzing The Ionosphere Using R
22 pages
Run Analysis
No ratings yet
Run Analysis
43 pages
Credit Card 1679991215
No ratings yet
Credit Card 1679991215
26 pages
E17ddba2 8f28 4673 85b5 C301bceae633 Ether Fraud Transaction Case Study
No ratings yet
E17ddba2 8f28 4673 85b5 C301bceae633 Ether Fraud Transaction Case Study
219 pages
Test Data
No ratings yet
Test Data
14 pages
Assignment Lab 1
No ratings yet
Assignment Lab 1
3 pages
20mia1032 A Sri Karthik Lab Model Exam Part 2
No ratings yet
20mia1032 A Sri Karthik Lab Model Exam Part 2
10 pages
Pressure Trace
No ratings yet
Pressure Trace
960 pages
Joints Column1 A Column2 Column3 B Column4 Column5
No ratings yet
Joints Column1 A Column2 Column3 B Column4 Column5
3 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
Pandas
No ratings yet
Pandas
4 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
Frequency Response H1 (ACCEL, FUERZA) - P11
No ratings yet
Frequency Response H1 (ACCEL, FUERZA) - P11
16 pages
Merged
No ratings yet
Merged
35 pages
Cátedra de Análisis Numérico: Guia de Laboratorio #6
No ratings yet
Cátedra de Análisis Numérico: Guia de Laboratorio #6
5 pages
Log Lammps 1400
No ratings yet
Log Lammps 1400
8 pages
Ds Pract 5 Data Analytics1 Vedanti
No ratings yet
Ds Pract 5 Data Analytics1 Vedanti
7 pages
Prac3.ipynb (Auto-R) - JupyterLab
No ratings yet
Prac3.ipynb (Auto-R) - JupyterLab
6 pages
CFD Assignment 120me0012 & 120me0018
No ratings yet
CFD Assignment 120me0012 & 120me0018
10 pages
Bank Loan
No ratings yet
Bank Loan
85 pages
MTA Project
No ratings yet
MTA Project
1 page
( ) - ( ) ( 2 4 ( 1 Cos ) ) 2 + 4 ( 1 Cos )
No ratings yet
( ) - ( ) ( 2 4 ( 1 Cos ) ) 2 + 4 ( 1 Cos )
5 pages
Statistical Data Analysis - Ipynb - Colaboratory
No ratings yet
Statistical Data Analysis - Ipynb - Colaboratory
6 pages
Implementing OLS Regression On Boston Housing Secondary Dataset. Also Check The Data For Missing Values and Outliers.
No ratings yet
Implementing OLS Regression On Boston Housing Secondary Dataset. Also Check The Data For Missing Values and Outliers.
26 pages
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
No ratings yet
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
20 pages
SFM
No ratings yet
SFM
17 pages
DAR CompleteFile 1
No ratings yet
DAR CompleteFile 1
41 pages
Cpi - SC
No ratings yet
Cpi - SC
24 pages
S11202415 - Lab 3
No ratings yet
S11202415 - Lab 3
10 pages
Output Huongdandoc
No ratings yet
Output Huongdandoc
19 pages
Columns 1 Through 4
No ratings yet
Columns 1 Through 4
4 pages
PTZ-BOX 5.0 1004121535 Daily 20211030150814
No ratings yet
PTZ-BOX 5.0 1004121535 Daily 20211030150814
5 pages
PIVlab
No ratings yet
PIVlab
72 pages
Anum Metodos
No ratings yet
Anum Metodos
7 pages
Granger Causality and VAR Models
No ratings yet
Granger Causality and VAR Models
1 page
DM Project
No ratings yet
DM Project
34 pages
Credit Card Default
No ratings yet
Credit Card Default
5 pages
A926534728 - 28953 - 8 - 2025 - Spark Mllib
No ratings yet
A926534728 - 28953 - 8 - 2025 - Spark Mllib
8 pages
A List of Factorial Math Constants
From Everand
A List of Factorial Math Constants
Archive Classics
No ratings yet
13 Fault Tree Analysis
100% (1)
13 Fault Tree Analysis
13 pages
Adarsh Product Catalog With Resolution
No ratings yet
Adarsh Product Catalog With Resolution
2 pages
Hussain Et Al., 2015
No ratings yet
Hussain Et Al., 2015
11 pages
Detention Volume Estimating Workbook (PDF) - 201404301105510967
No ratings yet
Detention Volume Estimating Workbook (PDF) - 201404301105510967
300 pages
MIcro End-Milling I - Wear and Breakage
No ratings yet
MIcro End-Milling I - Wear and Breakage
18 pages
The Nature and Goals of Anthropology, Sociology and Political Science
No ratings yet
The Nature and Goals of Anthropology, Sociology and Political Science
12 pages
Final Monsoon Report 2015 Punjab
No ratings yet
Final Monsoon Report 2015 Punjab
31 pages
Ielts Listening 2011, Official 2011
No ratings yet
Ielts Listening 2011, Official 2011
8 pages
Scotts DW2 Guide V1.7
No ratings yet
Scotts DW2 Guide V1.7
40 pages
Life Science Book
67% (3)
Life Science Book
448 pages
Vaya Linear MP RGB BCP424 50 RGB L1210 CE 60 Watt
No ratings yet
Vaya Linear MP RGB BCP424 50 RGB L1210 CE 60 Watt
3 pages
Untapped Mineral Potential of Somaliland Are View
No ratings yet
Untapped Mineral Potential of Somaliland Are View
12 pages
Jawi Unicode Compliant Font PDF
No ratings yet
Jawi Unicode Compliant Font PDF
5 pages
Strength and Durability of Mortar and Concrete Containing Rice Husk Ash: A Review
No ratings yet
Strength and Durability of Mortar and Concrete Containing Rice Husk Ash: A Review
15 pages
Global Supply Chains
No ratings yet
Global Supply Chains
25 pages
F1 Maths Bab 9
No ratings yet
F1 Maths Bab 9
6 pages
2023 English For Computer Science
No ratings yet
2023 English For Computer Science
134 pages
Oxford Insight Mathematics 10-5-25 3 AC For NSW Student Book Obook John Ley Michael Fuller Z Lib Org 60
No ratings yet
Oxford Insight Mathematics 10-5-25 3 AC For NSW Student Book Obook John Ley Michael Fuller Z Lib Org 60
1 page
gooFSM Research Full Chapters
No ratings yet
gooFSM Research Full Chapters
79 pages
Chapter 4 Vector Spaces - Part 2
No ratings yet
Chapter 4 Vector Spaces - Part 2
31 pages
Unit2.5 Compoundsand Solutions
No ratings yet
Unit2.5 Compoundsand Solutions
17 pages
RRB ALP 2024 CBT-1 and CBT-2 Complete Syllabus
No ratings yet
RRB ALP 2024 CBT-1 and CBT-2 Complete Syllabus
5 pages
Safari 8
No ratings yet
Safari 8
8 pages
IAN Akyildiz
No ratings yet
IAN Akyildiz
49 pages
G7 Q1 Icl Worksheet 6
No ratings yet
G7 Q1 Icl Worksheet 6
2 pages
Raslika Sharfina Nirwan: Professional Experience
100% (1)
Raslika Sharfina Nirwan: Professional Experience
1 page
3600-Article Text-14586-3-10-20200713
No ratings yet
3600-Article Text-14586-3-10-20200713
18 pages
Black Holes and Beyond
No ratings yet
Black Holes and Beyond
140 pages