0% found this document useful (0 votes)
11 views14 pages

Phase 2 New

The document discusses various techniques for data wrangling and preparing financial transaction data for fraud detection analysis. It covers data cleaning, feature engineering, outlier detection, data transformation, aggregation, handling imbalanced data, and splitting data into training, validation and test sets. The goal is to ensure high quality data and accurate fraud detection models.

Uploaded by

Ajay Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views14 pages

Phase 2 New

The document discusses various techniques for data wrangling and preparing financial transaction data for fraud detection analysis. It covers data cleaning, feature engineering, outlier detection, data transformation, aggregation, handling imbalanced data, and splitting data into training, validation and test sets. The goal is to ensure high quality data and accurate fraud detection models.

Uploaded by

Ajay Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Data Wrangling and Analysis

Introduction
In the intricate world of financial transactions, the specter of fraud looms large,
posing a constant threat to the integrity of systems and the trust of stakeholders. As
technology advances, so do the methods of fraudulent actors, necessitating a
dynamic and adaptive approach to detection and prevention. This paper delves into
the evolving landscape of fraud detection, exploring cutting-edge techniques and
collaborative efforts aimed at thwarting illicit activities. By examining the
intersection of data analytics, machine learning, and regulatory frameworks, we
seek to bolster defenses, minimize risks, and uphold the integrity of financial
systems. Join us on a journey to unravel the complexities of fraud detection, where
innovation and vigilance serve as our guiding principles in safeguarding assets and
preserving the integrity of financial transactions.

Objectives:
1. Early Detection: Implement systems capable of identifying fraudulent
activities at the earliest possible stage to minimize financial losses and
mitigate damage.
2. Accuracy: Strive for high accuracy in fraud detection algorithms to reduce
false positives and negatives, ensuring efficient allocation of resources for
investigation and prevention.
3. Adaptability: Develop flexible and adaptive fraud detection systems
capable of evolving alongside emerging fraud schemes and changing
regulatory landscapes.
4. Compliance: Ensure compliance with relevant laws, regulations, and
industry standards governing fraud detection and prevention to mitigate legal
risks and maintain trust.
5. Collaboration: Foster collaboration and information sharing among
financial institutions, regulatory bodies, law enforcement agencies, and
technology providers to enhance the collective ability to detect and combat
fraudulent activities effectively.
Dataset Description
The dataset comprises a collection of financial transactions spanning various types,
such as credit card transactions, wire transfers, and online payments. Each
transaction entry includes relevant features such as transaction amount, timestamp,
merchant information, and customer details. Additionally, the dataset contains
labels indicating whether each transaction is fraudulent or legitimate. With a
diverse range of transaction types and associated attributes, this dataset provides a
rich resource for training and evaluating fraud detection algorithms in real-world
scenarios.

Data Wrangling Techniques

1. Data Cleaning: Identify and handle missing values, outliers, and inconsistencies
in the dataset to ensure data quality and reliability for accurate fraud detection
models.

Code:
import pandas as pd

# Example data
data = {
'transaction_id': [1, 2, 3, 4, 5, 6],
'amount': [100, -200, 300, 400, 500, 600],
'merchant': ['A', 'B', 'C', 'A', 'B', 'C'],
'transaction_type': ['purchase', 'refund', 'purchase', 'purchase', 'purchase',
'refund'],
'is_fraud': [0, 1, 0, 0, 0, 1]
}

# Create DataFrame
df = pd.DataFrame(data)
# Remove transactions with negative amounts
df = df[df['amount'] > 0]

# Output before cleaning


print("Before Data Cleaning:")
print(df)

# Remove duplicates
df = df.drop_duplicates()

# Remove missing values


df = df.dropna()

# Output after cleaning


print("\nAfter Data Cleaning:")
print(df)

Output
2. Feature Engineering: Creating new features: Derived features such as
transaction frequency, transaction amount variability, and time-based features like
day of the week or time of day can provide valuable information for fraud
detection.

Code:
import pandas as pd
from datetime import datetime

# Sample data
data = {
'transaction_id': [1, 2, 3, 4, 5],
'amount': [100, 200, 150, 300, 400],
'merchant': ['A', 'B', 'C', 'A', 'B'],
'transaction_type': ['purchase', 'purchase', 'refund', 'purchase', 'purchase'],
'timestamp': ['2024-05-01 08:00:00', '2024-05-01 09:00:00', '2024-05-01
10:00:00', '2024-05-01 11:00:00', '2024-05-01 12:00:00'],
'is_fraud': [0, 0, 1, 0, 0]
}

# Create DataFrame
df = pd.DataFrame(data)

# Convert timestamp to datetime object


df['timestamp'] = pd.to_datetime(df['timestamp'])

# Feature engineering
df['hour_of_day'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['transaction_count_merchant'] = df.groupby('merchant')
['transaction_id'].transform('count')
df['transaction_total_merchant'] = df.groupby('merchant')
['amount'].transform('sum')

# Print the DataFrame after feature engineering


print("DataFrame after Feature Engineering:")
print(df)

Output

3. Outlier Detection and Treatment: Identifying outliers: Outliers in transaction


amounts or other features can indicate potentially fraudulent behavior.

Code:

import pandas as pd
import numpy as np

# Sample financial transactions data


data = {
'transaction_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'amount': [100, 200, 150, 300, 400, 500, 600, 700, 800, 900],
'is_fraud': [0, 0, 0, 0, 0, 1, 0, 0, 0, 0] # Assuming 6th transaction is fraudulent
}

# Create DataFrame
df = pd.DataFrame(data)
# Detect outliers using z-score method
threshold = 3
mean = np.mean(df['amount'])
std_dev = np.std(df['amount'])
df['z_score'] = (df['amount'] - mean) / std_dev

# Filter out transactions with z-score greater than threshold


df_filtered = df[df['z_score'].abs() <= threshold]

# Print DataFrame after outlier treatment


print("DataFrame after Outlier Detection and Treatment:")
print(df_filtered)

Output

4. Data Transformation: Normalization or standardization: Scaling numerical


features to a similar range can improve the performance of certain algorithms, such
as distance-based methods.
Code:
import pandas as pd

# Sample financial transactions data


data = {
'transaction_id': [1, 2, 3, 4, 5],
'amount': [100, 200, 150, 300, 400],
'merchant': ['A', 'B', 'C', 'A', 'B'],
'transaction_type': ['purchase', 'purchase', 'refund', 'purchase', 'purchase'],
'is_fraud': [0, 0, 1, 0, 0]
}

# Create DataFrame
df = pd.DataFrame(data)

# Data transformation: Normalization of 'amount' column


df['amount_normalized'] = (df['amount'] - df['amount'].min()) / (df['amount'].max()
- df['amount'].min())

# Print DataFrame after data transformation


print("DataFrame after Data Transformation:")
print(df)

Output
5. Data Aggregation and Summarization: Aggregating transactions: Grouping
transactions by attributes such as customer ID, merchant, or time period to
calculate summary statistics like total transaction amount, average transaction
amount, etc.

Code:

import pandas as pd

# Sample financial transactions data


data = {
'transaction_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'amount': [100, 200, 150, 300, 400, 500, 600, 700, 800, 900],
'merchant': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A'],
'is_fraud': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0] # Assuming 5th transaction is fraudulent
}

# Create DataFrame
df = pd.DataFrame(data)

# Data aggregation and summarization: Group by 'merchant' and calculate total


transaction amount and count
summary_df = df.groupby('merchant').agg({'amount': 'sum', 'transaction_id':
'count'}).reset_index()
summary_df.columns = ['merchant', 'total_transaction_amount',
'transaction_count']

# Print summary DataFrame


print("Summary DataFrame after Data Aggregation and Summarization:")
print(summary_df)

Output

6. Handling Imbalanced Data: Resampling techniques: Addressing class


imbalance by oversampling minority class instances, undersampling majority class
instances, or using more advanced techniques like SMOTE (Synthetic Minority
Over-sampling Technique).

Code:

import pandas as pd
from sklearn.utils import resample

# Sample financial transactions data


data = {
'transaction_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'amount': [100, 200, 150, 300, 400, 500, 600, 700, 800, 900],
'merchant': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A'],
'is_fraud': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0] # Assuming 5th transaction is fraudulent
}
# Create DataFrame
df = pd.DataFrame(data)

# Separate majority and minority classes


df_majority = df[df['is_fraud'] == 0]
df_minority = df[df['is_fraud'] == 1]

# Upsample minority class


df_minority_upsampled = resample(df_minority, replace=True,
n_samples=len(df_majority), random_state=42)

# Combine majority class with upsampled minority class


df_upsampled = pd.concat([df_majority, df_minority_upsampled])

# Display class distribution after resampling


print("Class distribution after resampling:")
print(df_upsampled['is_fraud'].value_counts())

Output:

7. Data Splitting: Splitting the data into training, validation, and test sets to
evaluate model performance effectively.

Code:

import pandas as pd
from sklearn.model_selection import train_test_split
# Sample financial transactions data
data = {
'transaction_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'amount': [100, 200, 150, 300, 400, 500, 600, 700, 800, 900],
'merchant': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A'],
'is_fraud': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0] # Assuming 5th transaction is fraudulent
}

# Create DataFrame
df = pd.DataFrame(data)

# Separate features and target variable


X = df.drop('is_fraud', axis=1)
y = df['is_fraud']

# Split the data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Display the shape of training and test sets


print("Shape of training set:", X_train.shape)
print("Shape of test set:", X_test.shape)

Output

Conclusion

Data wrangling techniques are vital for preparing raw data for analysis and
modeling. From cleaning data to handling imbalanced datasets, each step ensures
data integrity and model accuracy. Feature engineering uncovers valuable patterns,
while outlier detection prevents skewed results. Data transformation normalizes
features, enhancing model performance, and splitting data aids in robust
evaluation. Overall, data wrangling lays the foundation for effective fraud
detection, enabling accurate identification and prevention of fraudulent activities in
financial transactions.

Code:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils import resample

# Sample financial transactions data


data = {
'transaction_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'amount': [100, 200, 150, 300, 400, 500, 600, 700, 800, 900],
'merchant': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A'],
'timestamp': ['2024-05-01 08:00:00', '2024-05-01 09:00:00', '2024-05-01
10:00:00', '2024-05-01 11:00:00',
'2024-05-01 12:00:00', '2024-05-01 13:00:00', '2024-05-01 14:00:00',
'2024-05-01 15:00:00',
'2024-05-01 16:00:00', '2024-05-01 17:00:00'],
'is_fraud': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0] # Assuming 5th transaction is fraudulent
}

# Create DataFrame
df = pd.DataFrame(data)

# Data cleaning
df_cleaned = df.drop_duplicates().dropna().reset_index(drop=True)

# Feature engineering
df_cleaned['hour_of_day'] = pd.to_datetime(df_cleaned['timestamp']).dt.hour
df_cleaned['day_of_week'] =
pd.to_datetime(df_cleaned['timestamp']).dt.dayofweek
df_cleaned['transaction_count_merchant'] = df_cleaned.groupby('merchant')
['transaction_id'].transform('count')
df_cleaned['total_transaction_amount'] = df_cleaned.groupby('merchant')
['amount'].transform('sum')

# Handling imbalanced data


fraudulent = df_cleaned[df_cleaned['is_fraud'] == 1]
non_fraudulent = df_cleaned[df_cleaned['is_fraud'] == 0]
non_fraudulent_upsampled = resample(non_fraudulent, replace=True,
n_samples=len(fraudulent), random_state=42)
df_balanced = pd.concat([fraudulent, non_fraudulent_upsampled])

# Data transformation
df_balanced['normalized_amount'] = (df_balanced['amount'] -
df_balanced['amount'].min()) / (df_balanced['amount'].max() -
df_balanced['amount'].min())

# Data splitting
X = df_balanced.drop('is_fraud', axis=1)
y = df_balanced['is_fraud']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Output
print("Cleaned and Engineered Data:")
print(df_balanced)

Output

You might also like