Credit Card Fraud Detection
Credit Card Fraud Detection
Email : [email protected]
In this notebook I will try to predict fraud transactions from a given data set. Given that the data is imbalanced, standard metrics for
evaluating classification algorithm (such as accuracy) are invalid. I will focus on the following metrics: Sensitivity (true positive rate)
and Specificity (true negative rate). Of course, they are dependent on each other, so we want to find optimal trade-off between them.
Such trade-off usually depends on the application of the algorithm, and in case of fraud detection I would prefer to see high sensitivity
(e.g. given that a transaction is fraud, I want to be able to detect it with high probability).
IMPORTING LIBRARIES:
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pylab import rcParams
import warnings
warnings.filterwarnings('ignore')
READING DATASET :
In [2]:
data=pd.read_csv('/kaggle/input/creditcardfraud/creditcard.csv')
In [3]:
data.head()
Out[3]:
- - - - -
0 0.0 2.536347 1.378155 0.462388 0.239599 0.098698 0.363787 ... 0.277838 0.066928
1.359807 0.072781 0.338321 0.018307 0.110474
- - - - -
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 0.085102 ... 0.101288
0.082361 0.078803 0.255425 0.225775 0.638672 0.339846
- - - -
2 1.0 1.773209 0.379780 1.800499 0.791461 0.247676 ... 0.247998 0.771679 0.909412
1.358354 1.340163 0.503198 1.514654 0.689281
- - - - - - -
3 1.0 1.792993 1.247203 0.237609 0.377436 ... 0.005274
0.966272 0.185226 0.863291 0.010309 1.387024 0.108300 0.190321 1.175575
- - - - -
4 2.0 0.877737 1.548718 0.403034 0.095921 0.592941 0.817739 ... 0.798278 0.141267
1.158233 0.407193 0.270533 0.009431 0.137458
5 rows × 31 columns
NULL VALUES:
In [4]:
data.isnull().sum()
Out[4]:
Time 0
V1 0
V2 0
V3 0
V3 0
V4 0
V5 0
V6 0
V7 0
V8 0
V9 0
V10 0
V11 0
V12 0
V13 0
V14 0
V15 0
V16 0
V17 0
V18 0
V19 0
V20 0
V21 0
V22 0
V23 0
V24 0
V25 0
V26 0
V27 0
V28 0
Amount 0
Class 0
dtype: int64
INFORMATION
In [5]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time 284807 non-null float64
V1 284807 non-null float64
V2 284807 non-null float64
V3 284807 non-null float64
V4 284807 non-null float64
V5 284807 non-null float64
V6 284807 non-null float64
V7 284807 non-null float64
V8 284807 non-null float64
V9 284807 non-null float64
V10 284807 non-null float64
V11 284807 non-null float64
V12 284807 non-null float64
V13 284807 non-null float64
V14 284807 non-null float64
V15 284807 non-null float64
V16 284807 non-null float64
V17 284807 non-null float64
V18 284807 non-null float64
V19 284807 non-null float64
V20 284807 non-null float64
V21 284807 non-null float64
V22 284807 non-null float64
V23 284807 non-null float64
V24 284807 non-null float64
V25 284807 non-null float64
V26 284807 non-null float64
V27 284807 non-null float64
V28 284807 non-null float64
Amount 284807 non-null float64
Class 284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
DESCRIPTIVE STATISTICS
In [6]:
data.describe().T.head()
Out[6]:
In [7]:
data.shape
Out[7]:
(284807, 31)
In [8]:
data.columns
Out[8]:
Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
'Class'],
dtype='object')
In [9]:
fraud_cases=len(data[data['Class']==1])
In [10]:
In [11]:
non_fraud_cases=len(data[data['Class']==0])
In [12]:
print('Number of Non Fraud Cases:',non_fraud_cases)
In [13]:
fraud=data[data['Class']==1]
In [14]:
genuine=data[data['Class']==0]
In [15]:
fraud.Amount.describe()
Out[15]:
count 492.000000
mean 122.211321
std 256.683288
min 0.000000
25% 1.000000
50% 9.250000
75% 105.890000
max 2125.870000
Name: Amount, dtype: float64
In [16]:
genuine.Amount.describe()
Out[16]:
count 284315.000000
mean 88.291022
std 250.105092
min 0.000000
25% 5.650000
50% 22.000000
75% 77.050000
max 25691.160000
Name: Amount, dtype: float64
EDA
In [17]:
data.hist(figsize=(20,20),color='lime')
plt.show()
In [18]:
rcParams['figure.figsize'] = 16, 8
f,(ax1, ax2) = plt.subplots(2, 1, sharex=True)
f.suptitle('Time of transaction vs Amount by class')
ax1.scatter(fraud.Time, fraud.Amount)
ax1.set_title('Fraud')
ax2.scatter(genuine.Time, genuine.Amount)
ax2.set_title('Genuine')
plt.xlabel('Time (in Seconds)')
plt.ylabel('Amount')
plt.show()
CORRELATION
In [19]:
plt.figure(figsize=(10,8))
corr=data.corr()
sns.heatmap(corr,cmap='BuPu')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4c890f89b0>
In [20]:
from sklearn.model_selection import train_test_split
Model 1:
In [21]:
X=data.drop(['Class'],axis=1)
In [22]:
y=data['Class']
In [23]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=123)
In [24]:
In [25]:
rfc=RandomForestClassifier()
In [26]:
model=rfc.fit(X_train,y_train)
In [27]:
prediction=model.predict(X_test)
prediction=model.predict(X_test)
In [28]:
In [29]:
accuracy_score(y_test,prediction)
Out[29]:
0.9995786664794073
Model 2:
In [30]:
In [31]:
X1=data.drop(['Class'],axis=1)
In [32]:
y1=data['Class']
In [33]:
X1_train,X1_test,y1_train,y1_test=train_test_split(X1,y1,test_size=0.3,random_state=123)
In [34]:
lr=LogisticRegression()
In [35]:
model2=lr.fit(X1_train,y1_train)
In [36]:
prediction2=model2.predict(X1_test)
In [37]:
accuracy_score(y1_test,prediction2)
Out[37]:
0.9988764439450862
Model 3:
In [38]:
In [39]:
X2=data.drop(['Class'],axis=1)
In [40]:
In [40]:
y2=data['Class']
In [41]:
dt=DecisionTreeRegressor()
In [42]:
X2_train,X2_test,y2_train,y2_test=train_test_split(X2,y2,test_size=0.3,random_state=123)
In [43]:
model3=dt.fit(X2_train,y2_train)
In [44]:
prediction3=model3.predict(X2_test)
In [45]:
accuracy_score(y2_test,prediction3)
Out[45]:
0.999133925541004
In [ ]: