EDA and Similarity of Transactions On CreditCardFraudDetection
EDA and Similarity of Transactions On CreditCardFraudDetection
CreditCardFraudDetection
0.2.2 All the remaining details regarding the data set can be found in the below link.
0.2.3 CreditCardFraud
In [3]: import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
%matplotlib inline
1
In [5]: data.shape
In [6]: data.head()
Out[6]: Time V1 V2 V3 V4 V5 V6 V7 \
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941
[5 rows x 31 columns]
Out[7]: 0 284315
1 492
Name: Class, dtype: int64
Class label:
In [8]: # statistics
data.describe()
Out[8]: Time V1 V2 V3 V4 \
count 284807.000000 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean 94813.859575 3.919560e-15 5.688174e-16 -8.769071e-15 2.782312e-15
2
std 47488.145955 1.958696e+00 1.651309e+00 1.516255e+00 1.415869e+00
min 0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00
25% 54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01
50% 84692.000000 1.810880e-02 6.548556e-02 1.798463e-01 -1.984653e-02
75% 139320.500000 1.315642e+00 8.037239e-01 1.027196e+00 7.433413e-01
max 172792.000000 2.454930e+00 2.205773e+01 9.382558e+00 1.687534e+01
V5 V6 V7 V8 V9 \
count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean -1.552563e-15 2.010663e-15 -1.694249e-15 -1.927028e-16 -3.137024e-15
std 1.380247e+00 1.332271e+00 1.237094e+00 1.194353e+00 1.098632e+00
min -1.137433e+02 -2.616051e+01 -4.355724e+01 -7.321672e+01 -1.343407e+01
25% -6.915971e-01 -7.682956e-01 -5.540759e-01 -2.086297e-01 -6.430976e-01
50% -5.433583e-02 -2.741871e-01 4.010308e-02 2.235804e-02 -5.142873e-02
75% 6.119264e-01 3.985649e-01 5.704361e-01 3.273459e-01 5.971390e-01
max 3.480167e+01 7.330163e+01 1.205895e+02 2.000721e+01 1.559499e+01
Class
count 284807.000000
mean 0.001727
std 0.041527
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
[8 rows x 31 columns]
3
In [9]: # lets plot plain scatter plot considering Amount and Class
data.plot(kind='scatter', x='Amount', y='Class',title ='Amount verus Transactions type'
plt.show()
0.2.4 Observation:
1) We can see from the above Scatter plot that most of the transaction amounts are between 0
to 2500 for both normal and fraud.
4
In [11]: #Divide the dataset according to the label FraudTransactions and Normal Transactions
# Fraud means Class=1 and Normal means status =0
fraud=data.loc[data["Class"]==1]
normal=data.loc[data["Class"]==0]
In [12]: plt.figure(figsize=(10,5))
plt.subplot(121)
fraud.Amount.plot.hist(title="Histogram of Fraud transactions")
plt.subplot(122)
normal.Amount.plot.hist(title="Histogram of Normal transactions")
5
In [13]: print("Summary Statistics of fraud transactions:")
fraud.describe().Amount
Summary Statistics of fraud transactions:
6
0.2.5 Observation:
From the above plots and Statistics we can see that fraud transactions amount on average is
higher than normal transactions amount though absolute amount for normal transactions is
high. Based on this we cannot simply come up with a condition on amount to detect a fraud
transaction.
0.2.6 Let us Analyze fraud and normal transactions with respect to time - Though each trans-
action is different just out of curiosity, am checking fraud transactions occurence with
respect to time on two days
In [15]: # DataSet contains two days transactions.
# Feature 'Time' contains the seconds elapsed between each transaction and the first
# transaction in the dataset.let us convert time in seconds to hours of a day
dataSubset = data[['Time', 'Amount', 'Class']].copy()
7
In [19]: #Divide the data set according to the label FraudTransactions and Normal Transactions
# Fraud means Class=1 and Normal means status =0
frauddata=dataSubset.loc[data["Class"]==1]
normaldata=dataSubset.loc[data["Class"]==0]
In [20]: frauddata.describe()
8
0.2.7 Observation:
1) During the early hours i.e at (2 to 3 AM) there are more fraud transactions when compared
with normal transactions - may be more chance of occuring during that time.
0.2.8 Observation:
1) All most all the features are uncorrelated.
2)For every transaction in the sample top 10 transactions in the dataset which have the lowest
similarity(i,j).
In [22]: data.index.name='TransactionId'
## let us take 100 samples from the dataset using train_test_split without missing
# class distrubution in the original dataset.
from sklearn.model_selection import train_test_split
9
X_train,X_test,y_train,y_test = train_test_split(data.loc[:, data.columns != 'Class'],
data['Class'], test_size=0.00035, random_state=42)
sample = pd.concat([X_test, y_test], axis=1)
sample.shape
Out[24]: Time V1 V2 V3 V4 V5 \
TransactionId
43428 41505.0 -16.526507 8.584972 -18.649853 9.505594 -13.793819
49906 44261.0 0.339812 -2.743745 -0.134070 -1.385729 -1.451413
29474 35484.0 1.399590 -0.590701 0.168619 -1.029950 -0.539806
276481 167123.0 -0.432071 1.647895 -1.669361 -0.349504 0.785785
278846 168473.0 2.014160 -0.137394 -1.015839 0.327269 -0.182179
V6 V7 V8 V9 ... V21 \
TransactionId ...
43428 -2.832404 -16.701694 7.517344 -8.507059 ... 1.190739
49906 1.015887 -0.524379 0.224060 0.899746 ... -0.213436
29474 0.040444 -0.712567 0.002299 -0.971747 ... 0.102398
276481 -0.630647 0.276990 0.586025 -0.484715 ... 0.358932
278846 -0.956571 0.043241 -0.160746 0.363241 ... -0.238644
10
276481 0.873663 -0.178642 -0.017171 -0.207392 -0.157756 -0.237386
278846 -0.616400 0.347045 0.061561 -0.360196 0.174730 -0.078043
[5 rows x 31 columns]
def printSimilarity(key,similarity,data):
x=key[1]
s=similarity
y=key[0]
print ("Class = " + '{0:.5g}'.format(x) + ", Similarity = "+\
'{:f}'.format(s)+ ", transactionId = "+'{:d}'.format(y)+"\n")
if dict :
sorted_dict = sorted(dict, key=dict.__getitem__)[:10]
11
printResult(transaction1,sorted_dict,data,dict)
dict.clear()
--------------------------------------------------------
12
Class = 0, Similarity = 0.080592, transactionId = 20
--------------------------------------------------------
--------------------------------------------------------
13
Class = 0, Similarity = 0.028536, transactionId = 4
--------------------------------------------------------
--------------------------------------------------------
14
Class = 0, Similarity = 0.009565, transactionId = 3
--------------------------------------------------------
--------------------------------------------------------
15
Class = 0, Similarity = 0.000064, transactionId = 0
--------------------------------------------------------
--------------------------------------------------------
16
Similar transactions are :
--------------------------------------------------------
17
--------------------------------------------------------
--------------------------------------------------------
18
Class = 0, Similarity = 0.069861, transactionId = 20
--------------------------------------------------------
--------------------------------------------------------
19
Class = 0, Similarity = 0.028536, transactionId = 4
--------------------------------------------------------
--------------------------------------------------------
20
Class = 0, Similarity = 0.025668, transactionId = 51
--------------------------------------------------------
--------------------------------------------------------
21
Class = 0, Similarity = 0.004868, transactionId = 2
--------------------------------------------------------
--------------------------------------------------------
22
Class = 0, Similarity = 0.000238, transactionId = 1
--------------------------------------------------------
--------------------------------------------------------
23
For the transaction id = 283656, and Class = 0
--------------------------------------------------------
24
Class = 0, Similarity = 0.074950, transactionId = 8
--------------------------------------------------------
--------------------------------------------------------
25
Class = 0, Similarity = 0.052348, transactionId = 89
--------------------------------------------------------
--------------------------------------------------------
26
Class = 0, Similarity = 0.028086, transactionId = 164
--------------------------------------------------------
--------------------------------------------------------
27
Class = 0, Similarity = 0.009255, transactionId = 3
--------------------------------------------------------
--------------------------------------------------------
28
Class = 0, Similarity = 0.000019, transactionId = 0
--------------------------------------------------------
--------------------------------------------------------
29
Similar transactions are :
--------------------------------------------------------
30
--------------------------------------------------------
--------------------------------------------------------
31
Class = 0, Similarity = 0.069609, transactionId = 20
--------------------------------------------------------
--------------------------------------------------------
32
Class = 0, Similarity = 0.028711, transactionId = 4
--------------------------------------------------------
--------------------------------------------------------
33
Class = 0, Similarity = 0.025698, transactionId = 51
--------------------------------------------------------
--------------------------------------------------------
34
Class = 0, Similarity = 0.003947, transactionId = 2
--------------------------------------------------------
--------------------------------------------------------
35
Class = 0, Similarity = 0.025909, transactionId = 1
--------------------------------------------------------
--------------------------------------------------------
36
For the transaction id = 114027, and Class = 0
--------------------------------------------------------
37
Class = 0, Similarity = 0.074866, transactionId = 8
--------------------------------------------------------
--------------------------------------------------------
38
Class = 0, Similarity = 0.051928, transactionId = 89
--------------------------------------------------------
--------------------------------------------------------
39
Class = 0, Similarity = 0.028543, transactionId = 4
--------------------------------------------------------
--------------------------------------------------------
40
Class = 0, Similarity = 0.008103, transactionId = 3
--------------------------------------------------------
--------------------------------------------------------
41
Class = 0, Similarity = 0.000274, transactionId = 0
--------------------------------------------------------
--------------------------------------------------------
42
Similar transactions are :
--------------------------------------------------------
43
--------------------------------------------------------
--------------------------------------------------------
44
Class = 0, Similarity = 0.069476, transactionId = 20
--------------------------------------------------------
--------------------------------------------------------
45
Class = 0, Similarity = 0.028551, transactionId = 4
--------------------------------------------------------
--------------------------------------------------------
46
Class = 0, Similarity = 0.025880, transactionId = 51
--------------------------------------------------------
--------------------------------------------------------
47
Class = 0, Similarity = 0.002700, transactionId = 2
--------------------------------------------------------
--------------------------------------------------------
48
Class = 0, Similarity = 0.000174, transactionId = 1
--------------------------------------------------------
--------------------------------------------------------
49
For the transaction id = 210194, and Class = 0
--------------------------------------------------------
50
Class = 0, Similarity = 0.074867, transactionId = 8
--------------------------------------------------------
--------------------------------------------------------
51
Class = 0, Similarity = 0.051612, transactionId = 89
--------------------------------------------------------
--------------------------------------------------------
52
Class = 0, Similarity = 0.028217, transactionId = 164
--------------------------------------------------------
--------------------------------------------------------
53
Class = 0, Similarity = 0.008178, transactionId = 3
--------------------------------------------------------
--------------------------------------------------------
54
Class = 0, Similarity = 0.001191, transactionId = 0
--------------------------------------------------------
--------------------------------------------------------
55
Similar transactions are :
--------------------------------------------------------
56
--------------------------------------------------------
--------------------------------------------------------
57
Class = 0, Similarity = 0.078783, transactionId = 20
--------------------------------------------------------
--------------------------------------------------------
58
Class = 0, Similarity = 0.028938, transactionId = 4
--------------------------------------------------------
--------------------------------------------------------
59
Class = 0, Similarity = 0.026499, transactionId = 51
--------------------------------------------------------
--------------------------------------------------------
60
Class = 0, Similarity = 0.002654, transactionId = 2
--------------------------------------------------------
--------------------------------------------------------
61
Class = 0, Similarity = 0.000163, transactionId = 1
--------------------------------------------------------
--------------------------------------------------------
62
For the transaction id = 188412, and Class = 0
--------------------------------------------------------
63
Class = 0, Similarity = 0.075031, transactionId = 8
--------------------------------------------------------
--------------------------------------------------------
64
Class = 0, Similarity = 0.051941, transactionId = 89
--------------------------------------------------------
--------------------------------------------------------
65
Class = 0, Similarity = 0.026998, transactionId = 164
--------------------------------------------------------
66