Credit-Card - Notebooks - Preprocessed-Data - Data - Preprocessing - Ipynb at Main Shubhamdongarjal - Credit-Card
Credit-Card - Notebooks - Preprocessed-Data - Data - Preprocessing - Ipynb at Main Shubhamdongarjal - Credit-Card
credit-card
In [1]:
import pandas as pd
import numpy as np
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# To ignore warnings
import warnings
warnings.filterwarnings("ignore")
In [3]:
# Display the first few rows of the dataset
data.head()
Out[3]: Time V1 V2 V3 V4 V5 V6 V7 V8 V
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.3637
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.2554
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.5146
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.3870
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.8177
5 rows × 31 columns
In [4]:
# Summary of the dataset
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 284807 non-null float64
1 V1 284807 non-null float64
2 V2 284807 non-null float64
3 V3 284807 non-null float64
4 V4 284807 non-null float64
5 V5 284807 non-null float64
6 V6 284807 non-null float64
7 V7 284807 non-null float64
8 V8 284807 non-null float64
9 V9 284807 non-null float64
10 V10 284807 non-null float64
11 V11 284807 non-null float64
12 V12 284807 non-null float64
13 V13 284807 non-null float64
14 V14 284807 non-null float64
15 V15 284807 non-null float64
16 V16 284807 non-null float64
17 V17 284807 non-null float64
18 V18 284807 non-null float64
19 V19 284807 non-null float64
20 V20 284807 non-null float64
21 V21 284807 non-null float64
22 V22 284807 non-null float64
23 V23 284807 non-null float64
24 V24 284807 non-null float64
25 V25 284807 non-null float64
26 V26 284807 non-null float64
27 V27 284807 non-null float64
28 V28 284807 non-null float64
29 Amount 284807 non-null float64
30 Class 284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
In [5]:
# Statistical overview of the dataset
data.describe()
Out[5]: Time V1 V2 V3 V4 V5
8 rows × 31 columns
In [6]:
data.isnull().sum()
Out[6]: Time 0
V1 0
V2 0
V3 0
V4 0
V5 0
V6 0
V7 0
V8 0
V9 0
V10 0
V11 0
V12 0
V13 0
V14 0
V15 0
V16 0
V17 0
V18 0
V19 0
V20 0
V21 0
V22 0
V23 0
V24 0
V25 0
V26 0
V27 0
V28 0
Amount 0
Class 0
dtype: int64
In [7]:
# Class distribution
data['Class'].value_counts()
Out[7]: Class
0 284315
1 492
Name: count, dtype: int64
In [8]:
(data.groupby('Class')['Class'].count()/data['Class'].count()) * 100
Out[8]: Class
0 99.827251
1 0.172749
Name: Class, dtype: float64
In [9]:
data.dtypes
The dataset consists of 284,807 transactions with 31 features, including 'Time', 'Amount', and 28
anonymized features labeled V1 to V28. The target variable 'Class' indicates whether a transaction is
fraudulent (1) or not (0). There are no missing values in the dataset.
Exploratory Data Analysis (EDA)
In [10]:
# Plotting class distribution
sns.countplot(x='Class', data=data)
plt.title('Class Distribution')
plt.show()
The dataset is highly imbalanced, with the majority of transactions being non-fraudulent (Class 0).
Fraudulent transactions (Class 1) make up only 0.17% of the dataset.
In [11]:
# Plotting correlation heatmap
corr = data.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr, cmap='coolwarm', annot=False)
plt.title('Correlation Heatmap')
plt.show()
The correlation heatmap indicates that there are no extremely strong correlations among the features,
suggesting that multicollinearity is not a significant concern.
In [12]:
# Pie chart for class distribution
classes = data['Class'].value_counts()
normal_share = classes[0] / data['Class'].count() * 100
fraud_share = classes[1] / data['Class'].count() * 100
The pie chart further illustrates the significant imbalance between fraudulent and non-fraudulent
transactions, with fraudulent transactions accounting for only 0.17% of the total transactions.
In [13]:
print('The percentage without churn prediction is ', round(data['Class'].value_counts()[0]/l
print('The percentage with churn prediction is ', round(data['Class'].value_counts()[1]/len(
print('The ratio of imbalance is', round(data['Class'].value_counts()[1]/data['Class'].value
In [14]:
# Scatter plot for Time vs Class
Delta_Time = pd.to_timedelta(data['Time'], unit='s')
data['Time_Day'] = (Delta_Time.dt.components.days).astype(int)
data['Time_Hour'] = (Delta_Time.dt.components.hours).astype(int)
data['Time_Min'] = (Delta_Time.dt.components.minutes).astype(int)
fig = plt.figure(figsize=(14, 18))
cmap = sns.color_palette('Set2')
plt.subplot(3, 1, 1)
sns.scatterplot(x=data['Time'], y='Class', palette=cmap, data=data)
plt.xlabel('Time', size=18)
plt.ylabel('Class', size=18)
plt.tick_params(axis='x', labelsize=16)
plt.tick_params(axis='y', labelsize=16)
plt.title('Time vs Class Distribution', size=20, y=1.05)
plt.show()
The scatter plot of 'Time' vs 'Class' does not show any distinct pattern indicating that fraud occurs
uniformly over the period captured by the 'Time' feature.
In [15]:
# Fraud Vs Normal transactions by day
plt.figure(figsize=(5, 5))
sns.distplot(data[data['Class'] == 0]["Time_Day"], color='green')
sns.distplot(data[data['Class'] == 1]["Time_Day"], color='red')
plt.title('Fraud Vs Normal Transactions by Day', fontsize=17)
plt.show()
The distribution of fraudulent and non-fraudulent transactions by day appears similar, suggesting that
fraud occurs consistently across different days.
In [16]:
# Fraud Vs Normal transactions by hour
plt.figure(figsize=(15, 5))
sns.distplot(data[data['Class'] == 0]["Time_Hour"], color='green')
sns.distplot(data[data['Class'] == 1]["Time_Hour"], color='red')
plt.title('Fraud Vs Normal Transactions by Hour', fontsize=17)
plt.show()
The hourly distribution of fraudulent and non-fraudulent transactions shows that fraud is more
prevalent during certain hours of the day.
In [17]:
# Scatter plot for Amount vs Class
df_Fraud = data[data['Class'] == 1]
df_Regular = data[data['Class'] == 0]
In [18]:
# Fraud Transaction Amount Statistics
print(df_Fraud["Amount"].describe())
count 492.000000
mean 122.211321
std 256.683288
min 0.000000
25% 1.000000
50% 9.250000
75% 105.890000
max 2125.870000
Name: Amount, dtype: float64
In [19]:
# Regular Transaction Amount Statistics
print(df_Regular["Amount"].describe())
count 284315.000000
mean 88.291022
std 250.105092
min 0.000000
25% 5.650000
50% 22.000000
75% 77.050000
max 25691.160000
Name: Amount, dtype: float64
In [20]:
fig = plt.figure(figsize=(14, 18))
cmap = sns.color_palette('Set1')
plt.subplot(3, 1, 1)
sns.scatterplot(x=data['Amount'], y='Class', palette=cmap, data=data)
plt.xlabel('Amount', size=18)
plt.ylabel('Class', size=18)
plt.tick_params(axis='x', labelsize=16)
plt.tick_params(axis='y', labelsize=16)
plt.title('Amount vs Class Distribution', size=20, y=1.05)
plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.yscale('log')
plt.show()
Fraudulent transactions tend to have higher amounts on average compared to non-fraudulent ones. The
scatter plot and histograms show that while fraud can occur at various transaction amounts, there is a
noticeable concentration of higher amounts among fraudulent transactions.
In [21]:
# Boxplot for numerical attributes
numeric_features = data.select_dtypes(include=[np.number]).columns.tolist()
li_not_plot = ['Class', 'Time']
li_transform_num_feats = [c for c in list(numeric_features) if c not in li_not_plot]
sns.set_style("whitegrid")
f, ax = plt.subplots(figsize=(22, 34))
ax = sns.boxplot(data=data[li_transform_num_feats], orient="h", palette="Paired")
ax.set(ylabel="Features")
ax.set(xlabel="Values")
ax.set(title="Distribution of numerical attributes")
sns.despine(trim=True, left=True)
The boxplot indicates the presence of outliers in several numerical features. This suggests the need for
careful handling of these outliers during model training.
Data Cleaning
In [22]:
# Drop unnecessary columns
data.drop('Time', axis=1, inplace=True)
data.drop(['Time_Day', 'Time_Min'], axis=1, inplace=True)
In [23]:
# Amount distribution
plt.figure(figsize=(24, 12))
plt.subplot(2, 2, 1)
plt.title('Amount Distribution')
data['Amount'].astype(int).plot.hist()
plt.xlabel("variable Amount")
plt.subplot(2, 2, 2)
plt.title('Amount Distribution')
sns.set()
plt.xlabel("variable Amount")
plt.hist(data['Amount'], bins=100)
plt.show()
In [24]:
data = data.drop_duplicates()
duplicate entries were found and removed from the dataset. The 'Amount' feature was scaled to ensure
it has a mean of 0 and a standard deviation of 1, making it suitable for model training.
Scaling Features:
In [25]:
# Scale 'Amount' feature
scaler_amount = StandardScaler()
data['Amount'] = scaler_amount.fit_transform(pd.DataFrame(data['Amount']))
Feature scaling was applied to the 'Amount' feature. No additional feature engineering was performed
as the dataset consists of anonymized features.
In [27]:
# Balance the data with oversampling
def oversample_data(X, y):
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)
return X_res, y_res
In [28]:
# Process data for undersampling
X_res_undersample, y_res_undersample = undersample_data(data)
# Process data for oversampling
X = data.drop('Class', axis=1)
y = data['Class']
X_res_oversample, y_res_oversample = oversample_data(X, y)
SMOTE (Synthetic Minority Over-sampling Technique) was applied to the training data to address the
class imbalance. This technique generates synthetic samples for the minority class, resulting in a
balanced dataset. Additionally, undersampling was performed by randomly reducing the number of
majority class samples to match the number of minority class samples.
In [29]:
# Plot the histogram of variables to see the skewness
cols = list(X.columns.values)
normal_records = data.Class == 0
fraud_records = data.Class == 1
plt.figure(figsize=(20, 60))
for n, col in enumerate(cols):
plt.subplot(10, 3, n + 1)
sns.distplot(X[col][normal_records], color='green')
sns.distplot(X[col][fraud_records], color='red')
plt.title(col, fontsize=17)
plt.show()
Train/Test Split
In [30]:
# Train/Test split
X_train_under, X_test_under, y_train_under, y_test_under = train_test_split(X_res_undersampl
X_train_over, X_test_over, y_train_over, y_test_over = train_test_split(X_res_oversample, y_
The data was split into training and testing sets with a ratio of 80:20. This split ensures that we have a
sufficient amount of data for training the model while retaining enough data to evaluate the model's
performance.
In [31]:
# Create a dictionary with the data to save
data_dict = {
'X_train_under': X_train_under,
'X_test_under': X_test_under,
'y_train_under': y_train_under,
'y_test_under': y_test_under,
'X_train_over': X_train_over,
'X_test_over': X_test_over,
'y_train_over': y_train_over,
'y_test_over': y_test_over
}
In [32]:
# Define the base path
base_path = r'C:\Users\shubh\Downloads\upgrad assets\credit card project\credit-card-fraud-d
In [33]:
# Loop through the data dictionary and save each DataFrame to a CSV file
for name, df in data_dict.items():
file_path = f"{base_path}\\{name}.csv"
df.to_csv(file_path, index=False)
In [ ]: