Credit Card-Fraud-Detection
Credit Card-Fraud-Detection
libraries installed
# It is defined by the kaggle/python Docker image:
https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
/kaggle/input/creditcardfraud/creditcard.csv
df=pd.read_csv('/kaggle/input/creditcardfraud/creditcard.csv')
cdf=df.copy()
df.sample(10)
Time V1 V2 V3 V4 V5
V6 \
157897 110593.0 -0.184771 1.108228 -0.000420 -0.407888 1.116536 -
1.256184
278441 168224.0 1.551207 -0.899886 -3.103947 0.029130 0.882554 -
0.751118
178316 123580.0 -0.226187 1.396738 -0.766720 0.839888 1.009486 -
0.218086
173828 121657.0 2.061814 -0.002162 -1.047232 0.412224 -0.088962 -
1.203159
181882 125160.0 2.411499 -0.945026 -2.179706 -1.663150 -0.141163 -
1.013506
249600 154489.0 2.189191 -0.673278 -1.421533 -1.095467 -0.245381 -
0.695507
186741 127236.0 -0.037430 1.258773 0.874234 2.963097 1.008588
0.033458
130582 79385.0 1.073372 0.155477 -0.175165 0.711253 1.042201
1.588291
178615 123706.0 1.654301 -0.217762 -2.065709 1.432106 0.516326 -
0.495317
131258 79536.0 -1.333894 1.325309 1.691594 -0.018299 -0.444247 -
0.446008
It contains only numerical input variables which are the result of a PCA transformation.
Unfortunately, due to confidentiality issues, we cannot provide the original features and more
background information about the data. Features V1, V2, … V28 are the principal components
obtained with PCA, the only features which have not been transformed with PCA are 'Time' and
'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first
transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be
used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and
it takes value 1 in case of fraud and 0 otherwise.
df.describe()
Time V1 V2 V3
V4 \
count 284807.000000 2.848070e+05 2.848070e+05 2.848070e+05
2.848070e+05
mean 94813.859575 1.168375e-15 3.416908e-16 -1.379537e-15
2.074095e-15
std 47488.145955 1.958696e+00 1.651309e+00 1.516255e+00
1.415869e+00
min 0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -
5.683171e+00
25% 54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -
8.486401e-01
50% 84692.000000 1.810880e-02 6.548556e-02 1.798463e-01 -
1.984653e-02
75% 139320.500000 1.315642e+00 8.037239e-01 1.027196e+00
7.433413e-01
max 172792.000000 2.454930e+00 2.205773e+01 9.382558e+00
1.687534e+01
V5 V6 V7 V8
V9 \
count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
2.848070e+05
mean 9.604066e-16 1.487313e-15 -5.556467e-16 1.213481e-16 -
2.406331e-15
std 1.380247e+00 1.332271e+00 1.237094e+00 1.194353e+00
1.098632e+00
min -1.137433e+02 -2.616051e+01 -4.355724e+01 -7.321672e+01 -
1.343407e+01
25% -6.915971e-01 -7.682956e-01 -5.540759e-01 -2.086297e-01 -
6.430976e-01
50% -5.433583e-02 -2.741871e-01 4.010308e-02 2.235804e-02 -
5.142873e-02
75% 6.119264e-01 3.985649e-01 5.704361e-01 3.273459e-01
5.971390e-01
max 3.480167e+01 7.330163e+01 1.205895e+02 2.000721e+01
1.559499e+01
Class
count 284807.000000
mean 0.001727
std 0.041527
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
[8 rows x 31 columns]
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 284807 non-null float64
1 V1 284807 non-null float64
2 V2 284807 non-null float64
3 V3 284807 non-null float64
4 V4 284807 non-null float64
5 V5 284807 non-null float64
6 V6 284807 non-null float64
7 V7 284807 non-null float64
8 V8 284807 non-null float64
9 V9 284807 non-null float64
10 V10 284807 non-null float64
11 V11 284807 non-null float64
12 V12 284807 non-null float64
13 V13 284807 non-null float64
14 V14 284807 non-null float64
15 V15 284807 non-null float64
16 V16 284807 non-null float64
17 V17 284807 non-null float64
18 V18 284807 non-null float64
19 V19 284807 non-null float64
20 V20 284807 non-null float64
21 V21 284807 non-null float64
22 V22 284807 non-null float64
23 V23 284807 non-null float64
24 V24 284807 non-null float64
25 V25 284807 non-null float64
26 V26 284807 non-null float64
27 V27 284807 non-null float64
28 V28 284807 non-null float64
29 Amount 284807 non-null float64
30 Class 284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
df.isnull().sum()
Time 0
V1 0
V2 0
V3 0
V4 0
V5 0
V6 0
V7 0
V8 0
V9 0
V10 0
V11 0
V12 0
V13 0
V14 0
V15 0
V16 0
V17 0
V18 0
V19 0
V20 0
V21 0
V22 0
V23 0
V24 0
V25 0
V26 0
V27 0
V28 0
Amount 0
Class 0
dtype: int64
fig,ax=plt.subplots(1,2,figsize=(18,5))
sns.histplot(df['Time'],kde=True,color='#0f85d6',ax=ax[0])
ax[0].set_title('Time Distribution')
ax[0].set_xlim(df['Time'].min(),df['Time'].max())
sns.histplot(df['Amount'],kde=True,ax=ax[1],color='#ef6810')
ax[1].set_title("Amount Distibution")
ax[1].set_xlim(df['Amount'].min(),df['Amount'].max())
plt.show()
Time Distribution is normal in a time period but the distibution of Amount is too right skew due
to large difference between min,max value of feature Amount.confirm is it using boxplot
fig,ax=plt.subplots(1,2,figsize=(18,5))
sns.boxplot(df[['Time','Amount']],ax=ax[0],palette=['#ef6810','#ef103f
'])
ax[0].set_title('Box Plot of Time & Amount')
sns.boxplot(x='Class',y='Amount',data=df,ax=ax[1])
ax[1].set_title('Box Plot Between Class Vs Amount')
plt.show()
The above box plot suggest that feature Time has no outliers and feature Amount has lot's of
Outliers and second box plot suggest that Class 0 has more outliers then class 1 and range from
We have given that from V1 to V28 features are scaled so we also scaled the Time and Amount
feature for futher analysis using RobustScaler if you are curious about why we select
RobustScaler instead of Standerscaler because it use IQR ,method which is robust to extreme
(outliers) values
'X(normal) = X−median(X)/IQR(x)
from sklearn.preprocessing import RobustScaler
scaler=RobustScaler()
df['Time_Scaled']=scaler.fit_transform(df['Time'].values.reshape(-
1,1))
df['Amount_Scaled']=scaler.fit_transform(df['Amount'].values.reshape(-
1,1))
# Drop Time and Amount Feature from dataset as we add Time_Scaled and
Amount_Scaled in DataFrame
df.drop(columns=['Time','Amount'],inplace=True,axis=1)
df.head()
V1 V2 V3 V4 V5 V6
V7 \
0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388
0.239599
1 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -
0.078803
2 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499
0.791461
3 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203
0.237609
4 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921
0.592941
[5 rows x 31 columns]
fig,ax=plt.subplots(1,2,figsize=(18,5))
sns.histplot(df['Time_Scaled'],kde=True,color='#0f85d6',ax=ax[0])
ax[0].set_title('Scaled Time Distribution')
ax[0].set_xlim(df['Time_Scaled'].min(),df['Time_Scaled'].max())
sns.histplot(df['Amount_Scaled'],kde=True,ax=ax[1],color='#ef6810')
ax[1].set_title(" Scaled Amount Distibution")
ax[1].set_xlim(df['Amount_Scaled'].min(),df['Amount_Scaled'].max())
plt.show()
Distribution remain same cause RobustScaler doesn't affect the normal distributed data. Time
feature is normal distributed so Time_Scaled remain normal distributed. RobustScaler also not
change the skewness from data but reduce the impact of outliers which we can see in our
Amount_Scaled distribution
As we know out dataset contain 99.827% of Non-Fraudulent Data and 0.173% of Fraudulent
Data and for futher analysis we want to split into train,test but normal train_test_split method
doesn't work here so we import StratifiedKFold to split Imbalance data
X=df.drop(columns=['Class'])
y=df['Class']
skf=StratifiedKFold(n_splits=5)
for train_index, test_index in skf.split(X, y):
print("Train Index:", train_index, "Test Index:", test_index)
x_train,x_test=X.iloc[train_index],X.iloc[test_index]
y_train,y_test=y.iloc[train_index],y.iloc[test_index]
#y_train,y_test
Train Index: [ 30473 30496 31002 ... 284804 284805 284806] Test
Index: [ 0 1 2 ... 57017 57018 57019]
Train Index: [ 0 1 2 ... 284804 284805 284806] Test
Index: [ 30473 30496 31002 ... 113964 113965 113966]
Train Index: [ 0 1 2 ... 284804 284805 284806] Test
Index: [ 81609 82400 83053 ... 170946 170947 170948]
Train Index: [ 0 1 2 ... 284804 284805 284806] Test
Index: [150654 150660 150661 ... 227866 227867 227868]
Train Index: [ 0 1 2 ... 227866 227867 227868] Test
Index: [212516 212644 213092 ... 284804 284805 284806]
# let's confirm how much percentage each category fall in train and
test data
#For y_train
print(f"Non-Fraudulent Transcations in y_train:
{round(((y_train.values==0).sum()/y_train.shape[0])*100,3)}%")
#Percentage of fraudulent data in y_train
print(f"Fraudulent Transcations in y_train:
{round(((y_train.values==1).sum()/y_train.shape[0])*100,3)}%")
print()
#For y_test
print(f"Non-Fraudulent Transcations in y_test:
{round(((y_test.values==0).sum()/y_test.shape[0])*100,3)}%")
#Percentage of fraudulent data in y_test
print(f"Fraudulent Transcations in y_train:
{round(((y_test.values==1).sum()/y_test.shape[0])*100,3)}%")
Data split is done and we can confirm that there is almost equal percentage of Fraudulent data
in y_train and y_test so we don't have to worry about Model learning problem due to absent of
minor data percentage thanks to StratifiedKFold()
df.shape
(284807, 31)
Random Undersampling
new_df.sample(5)
V1 V2 V3 V4 V5 V6
V7 \
241445 -3.818214 2.551338 -4.759158 1.636967 -1.167900 -1.678413 -
3.144732
235616 0.218810 2.715855 -5.111658 6.310661 -0.848345 -0.882446 -
2.902079
197185 1.239276 -2.212808 -2.393390 -0.099816 1.242783 4.061149 -
0.709314
189701 -4.599447 2.762540 -4.656530 5.201403 -2.470388 -0.357618 -
3.767189
223915 1.795730 -1.502166 -2.715497 -1.538789 1.459905 3.469648 -
1.305134
Amount_Scaled
241445 -0.157898
235616 -0.296793
197185 5.784951
189701 0.996996
223915 2.333543
[5 rows x 31 columns]
V1 V2 V3 V4 V5
V6 \
V1 1.000000 -0.805935 0.880538 -0.608230 0.859491
0.332333
V2 -0.805935 1.000000 -0.861414 0.675716 -0.808640 -
0.283880
V3 0.880538 -0.861414 1.000000 -0.768327 0.852098
0.472096
V4 -0.608230 0.675716 -0.768327 1.000000 -0.579329 -
0.437202
V5 0.859491 -0.808640 0.852098 -0.579329 1.000000
0.309079
V6 0.332333 -0.283880 0.472096 -0.437202 0.309079
1.000000
V7 0.887652 -0.835954 0.891257 -0.712186 0.840496
0.287024
V8 -0.083680 -0.026762 -0.169888 0.099635 -0.203067 -
0.559226
V9 0.653440 -0.693041 0.768250 -0.793418 0.663338
0.380462
V10 0.733944 -0.762334 0.855893 -0.799664 0.755953
0.430772
V11 -0.523940 0.621186 -0.722307 0.796607 -0.528512 -
0.501937
V12 0.591359 -0.665715 0.761600 -0.833658 0.619884
0.502880
V13 -0.046324 0.021022 -0.065471 0.054192 -0.110624 -
0.091394
V14 0.438897 -0.561305 0.658645 -0.798618 0.435106
0.540010
V15 0.125115 -0.177649 0.150661 -0.148488 0.097946 -
0.060266
V16 0.634484 -0.633820 0.727486 -0.731792 0.694026
0.435757
V17 0.675110 -0.640208 0.738870 -0.712483 0.747308
0.431248
V18 0.676987 -0.612440 0.701647 -0.643889 0.744391
0.361393
V19 -0.301488 0.211352 -0.319109 0.317646 -0.399834 -
0.218701
V20 -0.326337 0.349376 -0.372206 0.303630 -0.329809 -
0.123561
V21 0.014390 0.044777 0.024170 -0.019212 0.040931
0.023638
V22 -0.033194 0.000708 -0.065118 0.139337 -0.094804 -
0.000066
V23 -0.033872 0.172902 -0.032930 0.021456 -0.093423
0.317014
V24 -0.082214 0.000592 -0.026072 -0.018264 -0.142346 -
0.075999
V25 -0.086378 0.151777 -0.115695 -0.007283 -0.114951 -
0.109415
V26 0.041588 0.015745 -0.020443 0.154458 0.044032 -
0.071080
V27 0.182976 -0.173336 0.106629 -0.025870 0.196096 -
0.167733
V28 0.210921 0.070419 0.131632 -0.092189 0.145784 -
0.018021
Class -0.429169 0.480641 -0.572221 0.712172 -0.379339 -
0.415159
Time_Scaled 0.233966 -0.205670 0.141509 -0.182244 0.267611
0.093709
Amount_Scaled -0.025660 -0.188290 -0.015966 0.012432 -0.117656
0.143857
V7 V8 V9 V10 ...
V22 \
V1 0.887652 -0.083680 0.653440 0.733944 ... -0.033194
Difficult to understand by just look into huge matrix so we draw a heatmap for better
understanding using visualization
plt.figure(figsize=(30,10))
sns.heatmap(corr,annot=True,cmap='BrBG')
plt.show()
From above heatmap we can conclude that V10,V12,V14,V16 features are highly correlated with
class in negative terms and in positive terms features V2,V4,V11,V19 are strongly correlated with
class than any other feature tell check how actually they affect Fraud and Non-Fraud
Classification.
fig,ax=plt.subplots(2,4,figsize=(30,10))
ax=ax.ravel()
sns.boxplot(x='Class',y='V16',data=new_df,ax=ax[0])
ax[0].set_title('Class vs V16 "Negative Correlation"')
sns.boxplot(x='Class',y='V14',data=new_df,ax=ax[1])
ax[1].set_title('Class vs V14 "Negative Correlation"')
sns.boxplot(x='Class',y='V12',data=new_df,ax=ax[2])
ax[2].set_title('Class vs V12 "Negative Correlation"')
sns.boxplot(x='Class',y='V10',data=new_df,ax=ax[3])
ax[3].set_title('Class vs V10 "Negative Correlation"')
sns.kdeplot(new_df['V16'].loc[new_df['Class']==1],color='#f83e07',ax=a
x[4])
ax[4].set_title('V16 Fraud Case Distribution')
sns.kdeplot(new_df['V14'].loc[new_df['Class']==1],color='#f8a407',ax=a
x[5])
ax[5].set_title('V14 Fraud Case Distribution')
sns.kdeplot(new_df['V12'].loc[new_df['Class']==1],color='#0762f8',ax=a
x[6])
ax[6].set_title('V12 Fraud Case Distribution')
sns.kdeplot(new_df['V10'].loc[new_df['Class']==1],color='#a007f8',ax=a
x[7])
ax[7].set_title('V10 Fraud Case Distribution')
plt.show()
In terms of negative correlation the distribution of V14 is only normal distributed as compare to
other negitively correlated features like V12,V16 and in V10 there is significant more number of
outliers in Fraud case as compare to others and if we talk about Non-Fraud case we find that
there are approximetly equal number of outliers.
fig,ax=plt.subplots(2,4,figsize=(30,10))
ax=ax.ravel()
sns.boxplot(x='Class',y='V2',data=new_df,ax=ax[0])
ax[0].set_title('Class vs V2 "Positive Correlation"')
sns.boxplot(x='Class',y='V4',data=new_df,ax=ax[1])
ax[1].set_title('Class vs V4 "Positive Correlation"')
sns.boxplot(x='Class',y='V11',data=new_df,ax=ax[2])
ax[2].set_title('Class vs V11 "Positive Correlation"')
sns.boxplot(x='Class',y='V19',data=new_df,ax=ax[3])
ax[3].set_title('Class vs V19 "Positive Correlation"')
sns.kdeplot(new_df['V2'].loc[new_df['Class']==1],color='#f83e07',ax=ax
[4])
ax[4].set_title('V2 Fraud Case Distribution')
sns.kdeplot(new_df['V4'].loc[new_df['Class']==1],color='#f8a407',ax=ax
[5])
ax[5].set_title('V4 Fraud Case Distribution')
sns.kdeplot(new_df['V11'].loc[new_df['Class']==1],color='#0762f8',ax=a
x[6])
ax[6].set_title('V11 Fraud Case Distribution')
sns.kdeplot(new_df['V19'].loc[new_df['Class']==1],color='#a007f8',ax=a
x[7])
ax[7].set_title('V19 Fraud Case Distribution')
plt.show()
only feature V19 has normal distribution of fraud cases as compare to rest of positively
correlated features.
# Let's perform IQR method to catch outliers and drop them from
dataframe and then split the data for train test split
print("....... For V10...... ")
v10_fraud=new_df['V10'].loc[new_df['Class']==1].values
Q1_v10,Q3_v10=np.percentile(v10_fraud,25),np.percentile(v10_fraud,75)
print(f'25 Percentile: {round(Q1_v10,4)}')
print(f'75 Percentile: {round(Q3_v10,4)}')
IQR_v10=Q3_v10-Q1_v10
print()
print("IQR",IQR_v10)
v10_cutoff=IQR_v10*1.5
print('Cutoff Value',IQR_v10*1.5)
lower_bound_v10,upper_bound_v10=Q1_v10-v10_cutoff,Q3_v10+v10_cutoff
print()
print('Lower Bound ',lower_bound_v10)
print('Upper Bound',upper_bound_v10)
print("Outliers of V10",outliers)
print('Number of outliers:',len(outliers))
new_df = new_df.drop(new_df[(new_df['V10'] > upper_bound_v10) |
(new_df['V10'] < lower_bound_v10)].index)
print('Number of Instances after outliers removal:
{}'.format(len(new_df)))
print()
print("....... For V12...... ")
v12_fraud=new_df['V12'].loc[new_df['Class']==1].values
Q1_v12,Q3_v12=np.percentile(v12_fraud,25),np.percentile(v12_fraud,75)
print(f'25 Percentile: {round(Q1_v12,4)}')
print(f'75 Percentile: {round(Q3_v12,4)}')
IQR_v12=Q3_v12-Q1_v12
print()
print("IQR",IQR_v12)
v12_cutoff=IQR_v12*1.5
print('Cutoff Value',IQR_v12*1.5)
lower_bound_v12,upper_bound_v12=Q1_v12-v12_cutoff,Q3_v12+v12_cutoff
print()
print('Lower Bound ',lower_bound_v12)
print('Upper Bound',upper_bound_v12)
print("Outliers of V12",outliers)
print('Number of outliers:',len(outliers))
new_df = new_df.drop(new_df[(new_df['V12'] > upper_bound_v12) |
(new_df['V12'] < lower_bound_v12)].index)
print('Number of Instances after outliers removal:
{}'.format(len(new_df)))
print()
print("....... For V14...... ")
v14_fraud=new_df['V14'].loc[new_df['Class']==1].values
Q1_v14,Q3_v14=np.percentile(v14_fraud,25),np.percentile(v14_fraud,75)
print(f'25 Percentile: {round(Q1_v14,4)}')
print(f'75 Percentile: {round(Q3_v14,4)}')
IQR_v14=Q3_v14-Q1_v14
print()
print("IQR",IQR_v14)
v14_cutoff=IQR_v14*1.5
print('Cutoff Value',IQR_v14*1.5)
lower_bound_v14,upper_bound_v14=Q1_v14-v14_cutoff,Q3_v14+v14_cutoff
print()
print('Lower Bound ',lower_bound_v14)
print('Upper Bound',upper_bound_v14)
print("Outliers of V12",outliers)
print('Number of outliers:',len(outliers))
new_df = new_df.drop(new_df[(new_df['V14'] > upper_bound_v14) |
(new_df['V14'] < lower_bound_v14)].index)
print('Number of Instances after outliers removal:
{}'.format(len(new_df)))
....... For V10......
25 Percentile: -7.7567
75 Percentile: -2.6142
IQR 5.142514314657911
Cutoff Value 7.713771471986866
IQR 5.63902050475877
Cutoff Value 8.458530757138156
IQR 5.125221763769894
Cutoff Value 7.687832645654842
sns.boxplot(x='Class',y='V12',data=new_df,ax=ax[1])
ax[1].set_title('Class vs V12 "Negative Correlation"')
sns.boxplot(x='Class',y='V10',data=new_df,ax=ax[2])
ax[2].set_title('Class vs V10 "Negative Correlation"')
plt.show()
Remaining outliers are extreme outliers if we try to remove them also we will loose
identification of feature
X_undersample=new_df.drop(columns='Class')
y_undersample=new_df['Class']
print('--------------------Logistic Regression-------------------')
lr=LogisticRegression()
lr.fit(x_train_us,y_train_us)
y_pred_lr=lr.predict(x_test_us)
print("ROC AUC Score",roc_auc_score(y_pred_lr,y_test_us))
print(classification_report(y_test_us,y_pred_lr))
print()
print('---------------------------SVM------------------------------')
svc=SVC()
svc.fit(x_train_us,y_train_us)
y_pred_svc=svc.predict(x_test_us)
print("ROC AUC Score",roc_auc_score(y_pred_svc,y_test_us))
print(classification_report(y_test_us,y_pred_svc))
print()
print('--------------------------Knn------------------------------')
knn=KNeighborsClassifier()
knn.fit(x_train_us,y_train_us)
y_pred_knn=knn.predict(x_test_us)
print("ROC AUC Score",roc_auc_score(y_pred_knn,y_test_us))
print(classification_report(y_test_us,y_pred_knn))
print()
print('----------------------GaussianNB----------------------')
gnb=GaussianNB()
gnb.fit(x_train_us,y_train_us)
y_pred_gnb=gnb.predict(x_test_us)
print("ROC AUC Score",roc_auc_score(y_pred_gnb,y_test_us))
print(classification_report(y_test_us,y_pred_gnb))
print()
print('------------------RandomForestClassifier----------------')
rfc=RandomForestClassifier()
rfc.fit(x_train_us,y_train_us)
y_pred_rfc=rfc.predict(x_test_us)
print("ROC AUC Score",roc_auc_score(y_pred_rfc,y_test_us))
print(classification_report(y_test_us,y_pred_rfc))
--------------------Logistic Regression-------------------
ROC AUC Score 0.9595238095238096
precision recall f1-score support
---------------------------SVM------------------------------
ROC AUC Score 0.950421700478687
precision recall f1-score support
--------------------------Knn------------------------------
ROC AUC Score 0.9459876543209875
precision recall f1-score support
----------------------GaussianNB----------------------
ROC AUC Score 0.9288807841349441
precision recall f1-score support
------------------RandomForestClassifier----------------
ROC AUC Score 0.9641968325791855
precision recall f1-score support
If you use cross validation score in above code then you can get better ROC AUC score than
above ROC AUC score because you do train and test more times and it make instances more
visible to model.
# Logistic Regressor
ConfusionMatrixDisplay.from_predictions(y_test_us, y_pred_lr,
ax=ax[0])
ax[0].set_title('Confusion Matrix of Logistic Regressor')
# K-Nearest Neighbors
ConfusionMatrixDisplay.from_predictions(y_test_us, y_pred_knn,
ax=ax[2])
ax[2].set_title('Confusion Matrix of KNN')
plt.tight_layout()
plt.show()
LogisticRegression(C=0.1)
tree_params = {"criterion": ["gini", "entropy"], "max_depth":
[4,6,2,None]}
grid_tree = GridSearchCV(RandomForestClassifier(),
tree_params,n_jobs=-1,cv=5)
grid_tree.fit(x_train_us, y_train_us)
grid_rfc = grid_tree.best_estimator_
print(grid_rfc)
RandomForestClassifier(criterion='entropy')
plt.figure(figsize=(10,6))
from sklearn.metrics import roc_curve, auc
y_score_lr = log_reg.predict_proba(x_test_us) # Logistic Regression
with hypertunning
y_score_svc = svc.decision_function(x_test_us) # Support Vector
Classifier
y_score_knn = knn.predict_proba(x_test_us) # K-Nearest Neighbors
y_score_gnb = gnb.predict_proba(x_test_us) # Gaussian Naive Bayes
y_score_rfc = grid_rfc.predict_proba(x_test_us) # Random Forest with
hypertunning
# Logistic Regressor
fpr_lr, tpr_lr, _ = roc_curve(y_test_us, y_score_lr[:, 1])
roc_auc_lr = auc(fpr_lr, tpr_lr)
plt.plot(fpr_lr, tpr_lr, label=f'Logistic Regressor (AUC =
{roc_auc_lr:.2f})')
# K-Nearest Neighbors
fpr_knn, tpr_knn, _ = roc_curve(y_test_us, y_score_knn[:, 1])
roc_auc_knn = auc(fpr_knn, tpr_knn)
plt.plot(fpr_knn, tpr_knn, label=f'KNN (AUC = {roc_auc_knn:.2f})')
# for svc
train_sizes2, train_scores2,
validation_scores2=learning_curve(svc,X_undersample,y_undersample,cv=c
v)
train_score_mean2=train_scores2.mean(axis=1)
validation_score_mean2=validation_scores2.mean(axis=1)
#for KneighbourClassifier
train_sizes3, train_scores3,
validation_scores3=learning_curve(knn,X_undersample,y_undersample,cv=c
v)
train_score_mean3=train_scores3.mean(axis=1)
validation_score_mean3=validation_scores3.mean(axis=1)
ax[2].plot(train_sizes3, train_score_mean3, label = 'Training error')
ax[2].plot(train_sizes3, validation_score_mean3, label = 'Validation
error')
ax[2].set_xlabel('Training set size')
ax[2].set_ylabel('Score')
ax[2].set_title('Learning curves for a KneighboursClassifier model')
ax[2].legend()
plt.tight_layout()
#For GuassainNB
train_sizes4, train_scores4,
validation_scores4=learning_curve(gnb,X_undersample,y_undersample,cv=c
v)
train_score_mean4=train_scores4.mean(axis=1)
validation_score_mean4=validation_scores4.mean(axis=1)
ax[3].plot(train_sizes4, train_score_mean4, label = 'Training error')
ax[3].plot(train_sizes4, validation_score_mean4, label = 'Validation
error')
ax[3].set_xlabel('Training set size')
ax[3].set_ylabel('Score')
ax[3].set_title('Learning curves for a GuassainNB model')
ax[3].legend()
plt.tight_layout()
#For RandomForestRegressor
train_sizes5, train_scores5,
validation_scores5=learning_curve(grid_rfc,X_undersample,y_undersample
,cv=cv)
train_score_mean5=train_scores5.mean(axis=1)
validation_score_mean5=validation_scores5.mean(axis=1)
ax[4].plot(train_sizes5, train_score_mean5, label = 'Training error')
ax[4].plot(train_sizes5, validation_score_mean5, label = 'Validation
error')
ax[4].set_xlabel('Training set size')
ax[4].set_ylabel('Score')
ax[4].set_title('Learning curves for a RandomForestRegressor model')
ax[4].legend()
plt.tight_layout()
plt.show()
# K-Nearest Neighbors
precision_knn, recall_knn, thresholds_knn =
precision_recall_curve(y_test_us, y_pred_knn)
pr_auc_knn = auc(recall_knn, precision_knn)
plt.plot(recall_knn, precision_knn, label=f'PR Curve KNN (AUC =
{pr_auc_knn:.2f})')
log_reg.fit(x_train,y_train)
LogisticRegression(C=0.1)
grid_rfc.fit(x_train,y_train)
RandomForestClassifier(criterion='entropy')
y_pred_original_lr=log_reg.predict(x_test)
y_pred_original_rfc=grid_rfc.predict(x_test)
#svc.fit(x_train,y_train)
#y_pred_original_svc=svc.predict(x_test)
fig,ax=plt.subplots(1,3,figsize=(20,5))
ConfusionMatrixDisplay.from_predictions(y_test,
y_pred_original_lr,ax=ax[0])
ConfusionMatrixDisplay.from_predictions(y_test,
y_pred_original_rfc,ax=ax[1])
#ConfusionMatrixDisplay.from_predictions(y_test,
y_pred_original_svc,ax=ax[2])
plt.show()
This time RANDOMFORESTCLASSIFER work better than LOGISTICREGRESSOR
SMOTE generates synthetic samples only for the minority class to balance it. These synthetic
samples are artificially created based on the training data. If SMOTE is applied before splitting,
the synthetic samples might "leak" into the test set. The model could then learn these synthetic
patterns and perform unrealistically well during testing.
sm=SMOTE(random_state=42)
x_train_resampled,y_train_resampled=sm.fit_resample(x_train,y_train)
from sklearn.model_selection import RandomizedSearchCV
LogisticRegression(C=0.1)
LogisticRegression(C=0.001)
y_pred_sm_us=logi_sm.predict_proba(x_test_us)
y_pred_sm_us1=logi_sm.predict(x_test_us)
print(classification_report(y_test_us,y_pred_sm_us1))
print()
ConfusionMatrixDisplay.from_predictions(y_test_us, y_pred_sm_us1)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at
0x79070eddd5d0>
precision_lr, recall_lr, thresholds_lr =
precision_recall_curve(y_test_us, y_pred_sm_us1)
pr_auc_sm_us = auc(recall_lr, precision_lr)
plt.plot(recall_lr, precision_lr, label=f'PR Curve LR (AUC =
{pr_auc_sm_us:.2f})')
plt.legend(loc='best')
plt.show()
y_score_sm_us = log_reg.predict_proba(x_test_us)
fpr_lr, tpr_lr, _ = roc_curve(y_test_us, y_score_sm_us[:, 1])
roc_auc_sm_us = auc(fpr_lr, tpr_lr)
plt.plot(fpr_lr, tpr_lr, label=f'Logistic Regressor (AUC =
{roc_auc_sm_us:.2f})')
plt.legend(loc='best')
plt.show()
pred=logi_sm.predict(x_test)
ConfusionMatrixDisplay.from_predictions(y_test,pred)
plt.show()
rfc_sm=RandomForestClassifier(criterion='entropy')
rfc_sm.fit(x_train_resampled,y_train_resampled)
RandomForestClassifier(criterion='entropy')
rfc_pred_sm=rfc_sm.predict(x_test)
ConfusionMatrixDisplay.from_predictions(y_test,rfc_pred_sm)
plt.show()
Final Decision:
In our analysis, we employed both undersampling and oversampling techniques to address class
imbalance. When using undersampling, we observed a higher number of false negatives (FN),
where actual fraud cases were misclassified as non-fraudulent. This is a significant concern as it
compromises fraud detection, affecting both the company and its users. On the other hand,
oversampling showed an improvement in reducing false negatives but introduced more false
positives (FP), where non-fraudulent cases were incorrectly flagged as fraudulent. False
positives, though less critical than false negatives, can harm the company’s reputation by
inconveniencing users, potentially leading to user dissatisfaction and service abandonment.
When we applied Logistic Regression as our model, undersampling resulted in fewer false
positives (only 3 cases), but the false negatives were significantly higher. This is problematic for
fraud detection as undetected fraudulent activities directly impact business operations and user
trust. Conversely, oversampling with Logistic Regression performed worse in terms of false
positives, which, though reduced fraud risk, increased customer dissatisfaction due to
unwarranted interventions.
Using Random Forest Classifier, however, yielded much better results. It outperformed Logistic
Regression in both undersampling and oversampling scenarios. Particularly with oversampling,
the Random Forest Classifier achieved the best performance, demonstrating the lowest number
of false negatives and just one false positive. This balance ensures robust fraud detection while
maintaining customer trust and minimizing inconvenience.
Based on these findings, we conclude that the Random Forest Classifier with oversampling is
the most suitable approach for our fraud detection system. Moving forward, further refinement
and optimization of this model could help achieve even greater accuracy and reliability.