0% found this document useful (0 votes)

47 views15 pages

Credit-Card - Notebooks - Preprocessed-Data - Data - Preprocessing - Ipynb at Main Shubhamdongarjal - Credit-Card

Uploaded by

status fellows

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views15 pages

Credit-Card - Notebooks - Preprocessed-Data - Data - Preprocessing - Ipynb at Main Shubhamdongarjal - Credit-Card

Uploaded by

status fellows

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Shubhamdongarjal /

credit-card

Code Issues Pull requests Actions Projects Wiki Security Insights

credit-card / notebooks / preprocessed-data / data_preprocessing.ipynb

Shubhamdongarjal add files 144c42d · 1 hour ago

1548 lines (1548 loc) · 1.38 MB

Preview Code Blame Raw

Introduction
In this project, we aim to build a classification model to predict whether a credit card transaction is
fraudulent or not. The dataset contains transactions made by credit cards in September 2013 by
European cardholders. Out of 284,807 transactions, 492 are fraudulent, making up 0.172% of all
transactions. This dataset is highly imbalanced, requiring specific techniques to handle the imbalance
effectively.

In [1]:
import pandas as pd
import numpy as np
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# To ignore warnings
import warnings
warnings.filterwarnings("ignore")

Data Loading and Overview

In [2]:
# Load the dataset
data = pd.read_csv(r'C:\Users\shubh\Downloads\upgrad assets\credit card project\credit-card-

In [3]:
# Display the first few rows of the dataset
data.head()

Out[3]: Time V1 V2 V3 V4 V5 V6 V7 V8 V

0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.3637

1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.2554

2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.5146

3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.3870

4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 -0.270533 0.8177

5 rows × 31 columns

In [4]:
# Summary of the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 284807 non-null float64
1 V1 284807 non-null float64
2 V2 284807 non-null float64
3 V3 284807 non-null float64
4 V4 284807 non-null float64
5 V5 284807 non-null float64
6 V6 284807 non-null float64
7 V7 284807 non-null float64
8 V8 284807 non-null float64
9 V9 284807 non-null float64
10 V10 284807 non-null float64
11 V11 284807 non-null float64
12 V12 284807 non-null float64
13 V13 284807 non-null float64
14 V14 284807 non-null float64
15 V15 284807 non-null float64
16 V16 284807 non-null float64
17 V17 284807 non-null float64
18 V18 284807 non-null float64
19 V19 284807 non-null float64
20 V20 284807 non-null float64
21 V21 284807 non-null float64
22 V22 284807 non-null float64
23 V23 284807 non-null float64
24 V24 284807 non-null float64
25 V25 284807 non-null float64
26 V26 284807 non-null float64
27 V27 284807 non-null float64
28 V28 284807 non-null float64
29 Amount 284807 non-null float64
30 Class 284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB

In [5]:
# Statistical overview of the dataset
data.describe()

Out[5]: Time V1 V2 V3 V4 V5

count 284807.000000 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.8

mean 94813.859575 1.759061e-12 -8.251130e-13 -9.654937e-13 8.321385e-13 1.649999e-13 4

std 47488.145955 1.958696e+00 1.651309e+00 1.516255e+00 1.415869e+00 1.380247e+00 1.

min 0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00 -1.137433e+02 -2.

25% 54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01 -6.915971e-01 -7

50% 84692.000000 1.810880e-02 6.548556e-02 1.798463e-01 -1.984653e-02 -5.433583e-02 -2

75% 139320.500000 1.315642e+00 8.037239e-01 1.027196e+00 7.433413e-01 6.119264e-01 3

max 172792.000000 2.454930e+00 2.205773e+01 9.382558e+00 1.687534e+01 3.480167e+01 7.

8 rows × 31 columns

In [6]:
data.isnull().sum()

Out[6]: Time 0
V1 0
V2 0
V3 0
V4 0
V5 0
V6 0
V7 0
V8 0
V9 0
V10 0
V11 0
V12 0
V13 0
V14 0
V15 0
V16 0
V17 0
V18 0
V19 0
V20 0
V21 0
V22 0
V23 0
V24 0
V25 0
V26 0
V27 0
V28 0
Amount 0
Class 0
dtype: int64

In [7]:
# Class distribution
data['Class'].value_counts()

Out[7]: Class
0 284315
1 492
Name: count, dtype: int64

In [8]:
(data.groupby('Class')['Class'].count()/data['Class'].count()) * 100

Out[8]: Class
0 99.827251
1 0.172749
Name: Class, dtype: float64

In [9]:
data.dtypes

Out[9]: Time float64

V1 float64
V2 float64
V3 float64
V4 float64
V5 float64
V6 float64
V7 float64
V8 float64
V9 float64
V10 float64
V11 float64
V12 float64
V13 float64
V14 float64
V15 float64
V16 float64
V17 float64
V18 float64
V19 float64
V20 float64
V21 float64
V22 float64
V23 float64
V24 float64
V25 float64
V26 float64
V27 float64
V28 float64
Amount float64
Class int64
dtype: object

The dataset consists of 284,807 transactions with 31 features, including 'Time', 'Amount', and 28
anonymized features labeled V1 to V28. The target variable 'Class' indicates whether a transaction is
fraudulent (1) or not (0). There are no missing values in the dataset.
Exploratory Data Analysis (EDA)
In [10]:
# Plotting class distribution
sns.countplot(x='Class', data=data)
plt.title('Class Distribution')
plt.show()

The dataset is highly imbalanced, with the majority of transactions being non-fraudulent (Class 0).
Fraudulent transactions (Class 1) make up only 0.17% of the dataset.

In [11]:
# Plotting correlation heatmap
corr = data.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr, cmap='coolwarm', annot=False)
plt.title('Correlation Heatmap')
plt.show()
The correlation heatmap indicates that there are no extremely strong correlations among the features,
suggesting that multicollinearity is not a significant concern.

In [12]:
# Pie chart for class distribution
classes = data['Class'].value_counts()
normal_share = classes[0] / data['Class'].count() * 100
fraud_share = classes[1] / data['Class'].count() * 100

labels = 'Non-Fraudulent', 'Fraudulent'

sizes = [normal_share, fraud_share]
explode = (0, 0.1)

fig1, ax1 = plt.subplots()

ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90
ax1.axis('equal')
plt.show()

The pie chart further illustrates the significant imbalance between fraudulent and non-fraudulent
transactions, with fraudulent transactions accounting for only 0.17% of the total transactions.

In [13]:
print('The percentage without churn prediction is ', round(data['Class'].value_counts()[0]/l
print('The percentage with churn prediction is ', round(data['Class'].value_counts()[1]/len(
print('The ratio of imbalance is', round(data['Class'].value_counts()[1]/data['Class'].value

The percentage without churn prediction is 99.83 % of the dataset

The percentage with churn prediction is 0.17 % of the dataset
The ratio of imbalance is 0.17

In [14]:
# Scatter plot for Time vs Class
Delta_Time = pd.to_timedelta(data['Time'], unit='s')
data['Time_Day'] = (Delta_Time.dt.components.days).astype(int)
data['Time_Hour'] = (Delta_Time.dt.components.hours).astype(int)
data['Time_Min'] = (Delta_Time.dt.components.minutes).astype(int)
fig = plt.figure(figsize=(14, 18))
cmap = sns.color_palette('Set2')

plt.subplot(3, 1, 1)
sns.scatterplot(x=data['Time'], y='Class', palette=cmap, data=data)
plt.xlabel('Time', size=18)
plt.ylabel('Class', size=18)
plt.tick_params(axis='x', labelsize=16)
plt.tick_params(axis='y', labelsize=16)
plt.title('Time vs Class Distribution', size=20, y=1.05)
plt.show()

The scatter plot of 'Time' vs 'Class' does not show any distinct pattern indicating that fraud occurs
uniformly over the period captured by the 'Time' feature.

In [15]:
# Fraud Vs Normal transactions by day
plt.figure(figsize=(5, 5))
sns.distplot(data[data['Class'] == 0]["Time_Day"], color='green')
sns.distplot(data[data['Class'] == 1]["Time_Day"], color='red')
plt.title('Fraud Vs Normal Transactions by Day', fontsize=17)
plt.show()
The distribution of fraudulent and non-fraudulent transactions by day appears similar, suggesting that
fraud occurs consistently across different days.

In [16]:
# Fraud Vs Normal transactions by hour
plt.figure(figsize=(15, 5))
sns.distplot(data[data['Class'] == 0]["Time_Hour"], color='green')
sns.distplot(data[data['Class'] == 1]["Time_Hour"], color='red')
plt.title('Fraud Vs Normal Transactions by Hour', fontsize=17)
plt.show()

f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12, 4))

bins = 50

ax1.hist(data.Time[data.Class == 1], bins=bins)

ax1.set_title('Fraud')

ax2.hist(data.Time[data.Class == 0], bins=bins)

ax2.set_title('Normal')

plt.xlabel('Time (in Seconds)')

plt.ylabel('Number of Transactions')
plt.show()

The hourly distribution of fraudulent and non-fraudulent transactions shows that fraud is more
prevalent during certain hours of the day.

In [17]:
# Scatter plot for Amount vs Class
df_Fraud = data[data['Class'] == 1]
df_Regular = data[data['Class'] == 0]

In [18]:
# Fraud Transaction Amount Statistics
print(df_Fraud["Amount"].describe())

count 492.000000
mean 122.211321
std 256.683288
min 0.000000
25% 1.000000
50% 9.250000
75% 105.890000
max 2125.870000
Name: Amount, dtype: float64

In [19]:
# Regular Transaction Amount Statistics
print(df_Regular["Amount"].describe())

count 284315.000000
mean 88.291022
std 250.105092
min 0.000000
25% 5.650000
50% 22.000000
75% 77.050000
max 25691.160000
Name: Amount, dtype: float64

In [20]:
fig = plt.figure(figsize=(14, 18))
cmap = sns.color_palette('Set1')

plt.subplot(3, 1, 1)
sns.scatterplot(x=data['Amount'], y='Class', palette=cmap, data=data)
plt.xlabel('Amount', size=18)
plt.ylabel('Class', size=18)
plt.tick_params(axis='x', labelsize=16)
plt.tick_params(axis='y', labelsize=16)
plt.title('Amount vs Class Distribution', size=20, y=1.05)

f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12, 4))

bins = 30

ax1.hist(data.Amount[data.Class == 1], bins=bins)

ax1.set_title('Fraud')

ax2.hist(data.Amount[data.Class == 0], bins=bins)

ax2.set_title('Normal')

plt.xlabel('Amount ($)')
plt.ylabel('Number of Transactions')
plt.yscale('log')
plt.show()
Fraudulent transactions tend to have higher amounts on average compared to non-fraudulent ones. The
scatter plot and histograms show that while fraud can occur at various transaction amounts, there is a
noticeable concentration of higher amounts among fraudulent transactions.

In [21]:
# Boxplot for numerical attributes
numeric_features = data.select_dtypes(include=[np.number]).columns.tolist()
li_not_plot = ['Class', 'Time']
li_transform_num_feats = [c for c in list(numeric_features) if c not in li_not_plot]

sns.set_style("whitegrid")
f, ax = plt.subplots(figsize=(22, 34))
ax = sns.boxplot(data=data[li_transform_num_feats], orient="h", palette="Paired")
ax.set(ylabel="Features")
ax.set(xlabel="Values")
ax.set(title="Distribution of numerical attributes")
sns.despine(trim=True, left=True)
The boxplot indicates the presence of outliers in several numerical features. This suggests the need for
careful handling of these outliers during model training.

Data Cleaning
In [22]:
# Drop unnecessary columns
data.drop('Time', axis=1, inplace=True)
data.drop(['Time_Day', 'Time_Min'], axis=1, inplace=True)

In [23]:
# Amount distribution
plt.figure(figsize=(24, 12))

plt.subplot(2, 2, 1)
plt.title('Amount Distribution')
data['Amount'].astype(int).plot.hist()
plt.xlabel("variable Amount")

plt.subplot(2, 2, 2)
plt.title('Amount Distribution')
sns.set()
plt.xlabel("variable Amount")
plt.hist(data['Amount'], bins=100)
plt.show()

In [24]:
data = data.drop_duplicates()

duplicate entries were found and removed from the dataset. The 'Amount' feature was scaled to ensure
it has a mean of 0 and a standard deviation of 1, making it suitable for model training.

Scaling Features:
In [25]:
# Scale 'Amount' feature
scaler_amount = StandardScaler()
data['Amount'] = scaler_amount.fit_transform(pd.DataFrame(data['Amount']))
Feature scaling was applied to the 'Amount' feature. No additional feature engineering was performed
as the dataset consists of anonymized features.

Handling Imbalanced Data

In [26]:
# Balance the data with undersampling
def undersample_data(data):
majority_class = data[data.Class == 0]
minority_class = data[data.Class == 1]
majority_downsampled = resample(majority_class, replace=False, n_samples=len(minority_cl
balanced_data = pd.concat([majority_downsampled, minority_class])
X = balanced_data.drop('Class', axis=1)
y = balanced_data['Class']
return X, y

In [27]:
# Balance the data with oversampling
def oversample_data(X, y):
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)
return X_res, y_res

In [28]:
# Process data for undersampling
X_res_undersample, y_res_undersample = undersample_data(data)
# Process data for oversampling
X = data.drop('Class', axis=1)
y = data['Class']
X_res_oversample, y_res_oversample = oversample_data(X, y)

SMOTE (Synthetic Minority Over-sampling Technique) was applied to the training data to address the
class imbalance. This technique generates synthetic samples for the minority class, resulting in a
balanced dataset. Additionally, undersampling was performed by randomly reducing the number of
majority class samples to match the number of minority class samples.

In [29]:
# Plot the histogram of variables to see the skewness
cols = list(X.columns.values)
normal_records = data.Class == 0
fraud_records = data.Class == 1

plt.figure(figsize=(20, 60))
for n, col in enumerate(cols):
plt.subplot(10, 3, n + 1)
sns.distplot(X[col][normal_records], color='green')
sns.distplot(X[col][fraud_records], color='red')
plt.title(col, fontsize=17)
plt.show()
Train/Test Split
In [30]:
# Train/Test split
X_train_under, X_test_under, y_train_under, y_test_under = train_test_split(X_res_undersampl
X_train_over, X_test_over, y_train_over, y_test_over = train_test_split(X_res_oversample, y_

The data was split into training and testing sets with a ratio of 80:20. This split ensures that we have a
sufficient amount of data for training the model while retaining enough data to evaluate the model's
performance.

In [31]:
# Create a dictionary with the data to save
data_dict = {
'X_train_under': X_train_under,
'X_test_under': X_test_under,
'y_train_under': y_train_under,
'y_test_under': y_test_under,
'X_train_over': X_train_over,
'X_test_over': X_test_over,
'y_train_over': y_train_over,
'y_test_over': y_test_over
}

In [32]:
# Define the base path
base_path = r'C:\Users\shubh\Downloads\upgrad assets\credit card project\credit-card-fraud-d

In [33]:
# Loop through the data dictionary and save each DataFrame to a CSV file
for name, df in data_dict.items():
file_path = f"{base_path}\\{name}.csv"
df.to_csv(file_path, index=False)

print("All files have been saved successfully.")

All files have been saved successfully

All files have been saved successfully.

In [ ]:

Aosdijfpqoiew
No ratings yet
Aosdijfpqoiew
6 pages
Project Scope Report
100% (1)
Project Scope Report
21 pages
FRA Milestone 1 Jupyter Notebook PDF
100% (3)
FRA Milestone 1 Jupyter Notebook PDF
42 pages
Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
100% (2)
Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
31 pages
Import: Sys - Executable - M Pip Install
No ratings yet
Import: Sys - Executable - M Pip Install
23 pages
Design Engineer or Mechanical Designer or E/M Designer or PDM or
100% (1)
Design Engineer or Mechanical Designer or E/M Designer or PDM or
2 pages
Calculating Volumes of Bulk Solids in Mass Flow Rectangular Storage Bins - B Goldsmith
No ratings yet
Calculating Volumes of Bulk Solids in Mass Flow Rectangular Storage Bins - B Goldsmith
5 pages
Credit Card Fraud Detection
No ratings yet
Credit Card Fraud Detection
8 pages
Credit Card Fraud Detection With CNN 99 Accuracy
No ratings yet
Credit Card Fraud Detection With CNN 99 Accuracy
12 pages
Capstone Report: FIRST NAME: Gopalakrishnan LAST NAME: Kalarikovilagam Subramanian M12821535
No ratings yet
Capstone Report: FIRST NAME: Gopalakrishnan LAST NAME: Kalarikovilagam Subramanian M12821535
17 pages
Credit Card-Fraud-Detection
No ratings yet
Credit Card-Fraud-Detection
39 pages
Fraud Transaction Detection - Ipynb - Colab - Rameshkumar
No ratings yet
Fraud Transaction Detection - Ipynb - Colab - Rameshkumar
7 pages
Observation: As We Can See We Have Threwe Types of Datatypes I.E. (Int, Float, Object) That Means We Have Both Categorical and Numerical Data
No ratings yet
Observation: As We Can See We Have Threwe Types of Datatypes I.E. (Int, Float, Object) That Means We Have Both Categorical and Numerical Data
2 pages
Credit - Card - Fraud - Detection Using ML - Jupyter Notebook
No ratings yet
Credit - Card - Fraud - Detection Using ML - Jupyter Notebook
12 pages
Credit Card Fraud Detection
No ratings yet
Credit Card Fraud Detection
101 pages
Matplotlib Library in Python
No ratings yet
Matplotlib Library in Python
85 pages
EDA and Similarity of Transactions On CreditCardFraudDetection
No ratings yet
EDA and Similarity of Transactions On CreditCardFraudDetection
66 pages
Projet Swift
No ratings yet
Projet Swift
12 pages
Colab Research - Google
No ratings yet
Colab Research - Google
1 page
Danmairo - Analysis - Ipynb - Colaboratory
No ratings yet
Danmairo - Analysis - Ipynb - Colaboratory
18 pages
Task 2 Exploratory Data Analysis
No ratings yet
Task 2 Exploratory Data Analysis
5 pages
Clustering
No ratings yet
Clustering
53 pages
DM Project
No ratings yet
DM Project
34 pages
Ayush
No ratings yet
Ayush
23 pages
DM Project
No ratings yet
DM Project
36 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
Fraud 2
No ratings yet
Fraud 2
20 pages
Python Report Ritik
No ratings yet
Python Report Ritik
15 pages
AIML Lab Ex 3-5 - 1
No ratings yet
AIML Lab Ex 3-5 - 1
31 pages
Lecture 2
No ratings yet
Lecture 2
30 pages
Exp 12 and 15
No ratings yet
Exp 12 and 15
4 pages
"Normal" "Fraud": #Check For Any Null Values
No ratings yet
"Normal" "Fraud": #Check For Any Null Values
7 pages
Important Questions With Solutions IP
No ratings yet
Important Questions With Solutions IP
5 pages
Test 2
No ratings yet
Test 2
11 pages
Chirayu (1) Merged Merged
No ratings yet
Chirayu (1) Merged Merged
76 pages
Xtasy
No ratings yet
Xtasy
14 pages
Practical File Questions With Answers
No ratings yet
Practical File Questions With Answers
7 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
ML Lab Manual 1-10
No ratings yet
ML Lab Manual 1-10
58 pages
Data Preprocessing Python Tome III
No ratings yet
Data Preprocessing Python Tome III
12 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
12th Practical
No ratings yet
12th Practical
21 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
Lecture 12 - Art and Science of Data Visualization
No ratings yet
Lecture 12 - Art and Science of Data Visualization
21 pages
Attribute Types
No ratings yet
Attribute Types
11 pages
Aiml Lab Manaual R23
100% (1)
Aiml Lab Manaual R23
10 pages
Certificate
No ratings yet
Certificate
25 pages
2,3. Introduction Pandas & Matplotlib
No ratings yet
2,3. Introduction Pandas & Matplotlib
32 pages
Practical File IP Class 12 2024 25 Sharing Removed
No ratings yet
Practical File IP Class 12 2024 25 Sharing Removed
29 pages
Afbpr 7
No ratings yet
Afbpr 7
7 pages
Fraud Transaction Prediction
No ratings yet
Fraud Transaction Prediction
26 pages
Credit Card 1679991215
No ratings yet
Credit Card 1679991215
26 pages
Data Mining - Project
100% (2)
Data Mining - Project
11 pages
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
No ratings yet
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
15 pages
Tsne On Credit Card
No ratings yet
Tsne On Credit Card
9 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Churn For Bank Customers
No ratings yet
Churn For Bank Customers
28 pages
Kunal Assignment 3
No ratings yet
Kunal Assignment 3
19 pages
Data Mining Project - Brahma Chari
No ratings yet
Data Mining Project - Brahma Chari
23 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Exploring the Intersection of Metaverse and Cryptocurrency
From Everand
Exploring the Intersection of Metaverse and Cryptocurrency
Aditeya Khatri
No ratings yet
A3 Alpha Meter With EA - NIC: Product Guide PG42-1017A
No ratings yet
A3 Alpha Meter With EA - NIC: Product Guide PG42-1017A
24 pages
Triple-I Final Paper .11
No ratings yet
Triple-I Final Paper .11
89 pages
ACCOMMODATION INSPECTION CHECKLIST - Staff
67% (3)
ACCOMMODATION INSPECTION CHECKLIST - Staff
2 pages
Thesis On Flooding in Nigeria
100% (3)
Thesis On Flooding in Nigeria
4 pages
Despicable Me
No ratings yet
Despicable Me
95 pages
Mep Hvac 3
100% (1)
Mep Hvac 3
133 pages
ST 2SA928: PNP Silicon Epitaxial Planar Transistor
No ratings yet
ST 2SA928: PNP Silicon Epitaxial Planar Transistor
1 page
Chims Try
No ratings yet
Chims Try
12 pages
28subat Yokdil Sosyal
No ratings yet
28subat Yokdil Sosyal
23 pages
THE K-Epsilon-R-T TURBULENCE CLOSURE: Engineering Applications of Computational Fluid Mechanics June 2009
No ratings yet
THE K-Epsilon-R-T TURBULENCE CLOSURE: Engineering Applications of Computational Fluid Mechanics June 2009
10 pages
Paint Specification of IRPC (Chugoku1)
No ratings yet
Paint Specification of IRPC (Chugoku1)
6 pages
Cataract, Senile: Author: Vicente Victor D Ocampo JR, MD, Head, Uveitis and Ocular Immunology
No ratings yet
Cataract, Senile: Author: Vicente Victor D Ocampo JR, MD, Head, Uveitis and Ocular Immunology
7 pages
Explosive Welding
No ratings yet
Explosive Welding
13 pages
Pico Explorer Base: Features
No ratings yet
Pico Explorer Base: Features
2 pages
Enex Sustainability Report 2021
No ratings yet
Enex Sustainability Report 2021
109 pages
Geodesy Pre Board Midterm Submissions: Standalone Assignment
No ratings yet
Geodesy Pre Board Midterm Submissions: Standalone Assignment
16 pages
Dosimeter (User Manual)
No ratings yet
Dosimeter (User Manual)
17 pages
Resume Vikash+tiwari+10314802818
No ratings yet
Resume Vikash+tiwari+10314802818
1 page
Marine Environment and Their Divisions
No ratings yet
Marine Environment and Their Divisions
11 pages
Katangiang Pisikal NG Mga Rehiyon Sa Asya
No ratings yet
Katangiang Pisikal NG Mga Rehiyon Sa Asya
86 pages
GEN. PHYSICS 1 (Week 6)
No ratings yet
GEN. PHYSICS 1 (Week 6)
5 pages
Soal UKK Bahasa Inggris Kelas 2 Terbaru Tahun 2018-Dikonversi
No ratings yet
Soal UKK Bahasa Inggris Kelas 2 Terbaru Tahun 2018-Dikonversi
4 pages
Apocalyptic Record Preview 1 - Ahroun
No ratings yet
Apocalyptic Record Preview 1 - Ahroun
53 pages
(Laurent Lessard) Convex Programming
No ratings yet
(Laurent Lessard) Convex Programming
24 pages
Instruction Booc For Electric Engine Telegraph Logger PDF
No ratings yet
Instruction Booc For Electric Engine Telegraph Logger PDF
61 pages
CS5500: Reinforcement Learning Assignment 3: Additional Guidelines
No ratings yet
CS5500: Reinforcement Learning Assignment 3: Additional Guidelines
7 pages
Play Types
No ratings yet
Play Types
11 pages

Credit-Card - Notebooks - Preprocessed-Data - Data - Preprocessing - Ipynb at Main Shubhamdongarjal - Credit-Card

Uploaded by

Credit-Card - Notebooks - Preprocessed-Data - Data - Preprocessing - Ipynb at Main Shubhamdongarjal - Credit-Card

Uploaded by

Shubhamdongarjal /

Code Issues Pull requests Actions Projects Wiki Security Insights

credit-card / notebooks / preprocessed-data / data_preprocessing.ipynb

Shubhamdongarjal add files 144c42d · 1 hour ago

1548 lines (1548 loc) · 1.38 MB

Preview Code Blame Raw

Data Loading and Overview

count 284807.000000 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.8

mean 94813.859575 1.759061e-12 -8.251130e-13 -9.654937e-13 8.321385e-13 1.649999e-13 4

std 47488.145955 1.958696e+00 1.651309e+00 1.516255e+00 1.415869e+00 1.380247e+00 1.

min 0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00 -1.137433e+02 -2.

25% 54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01 -6.915971e-01 -7

50% 84692.000000 1.810880e-02 6.548556e-02 1.798463e-01 -1.984653e-02 -5.433583e-02 -2

75% 139320.500000 1.315642e+00 8.037239e-01 1.027196e+00 7.433413e-01 6.119264e-01 3

max 172792.000000 2.454930e+00 2.205773e+01 9.382558e+00 1.687534e+01 3.480167e+01 7.

Out[9]: Time float64

labels = 'Non-Fraudulent', 'Fraudulent'

fig1, ax1 = plt.subplots()

The percentage without churn prediction is 99.83 % of the dataset

f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12, 4))

ax1.hist(data.Time[data.Class == 1], bins=bins)

ax2.hist(data.Time[data.Class == 0], bins=bins)

plt.xlabel('Time (in Seconds)')

f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12, 4))

ax1.hist(data.Amount[data.Class == 1], bins=bins)

ax2.hist(data.Amount[data.Class == 0], bins=bins)

Handling Imbalanced Data

print("All files have been saved successfully.")

All files have been saved successfully

You might also like