Pdsreport
Pdsreport
1. Introduction
With the rapid growth of online transactions and digital payments, fraud detection has become a critical
issue for financial institutions and businesses. Fraudulent transactions not only cause financial losses but
also damage an organization's reputation. This project focuses on developing a machine learning-based
fraud detection system using various algorithms to classify transactions as legitimate or fraudulent.
The goal was to build models capable of identifying fraudulent activities from transactional datasets while
minimizing false positives. This report covers the dataset used, data preprocessing steps, feature
engineering, machine learning models applied, evaluation metrics, and performance analysis.
2. Problem Statement
The project aims to detect fraudulent financial transactions using machine learning models. The challenge
is to accurately classify transactions as legitimate or fraudulent, ensuring high recall (to catch fraud cases)
and high precision (to reduce false alarms).
Objectives:
3. Dataset Description
The dataset used in this project consists of historical transactional data, including various features related
to transaction details. This dataset is highly imbalanced, with fraudulent transactions being only a small
fraction of the total data.
Columns:
Account Balance (Before and After): Account balance before and after the transaction.
Fraud Label: Indicates whether the transaction is fraudulent (1) or legitimate (0).
Merchant ID: Identifier for the merchant where the transaction took place.
Customer ID: Identifier for the customer involved in the transaction.
Device ID: Identifier for the device used to make the transaction.
4. Data Preprocessing
Given the complexity of transactional data, several preprocessing steps were applied to clean and
transform the dataset for machine learning:
2. Feature Scaling:
Numerical features such as transaction amounts and balances were normalized using Min-Max
scaling to ensure equal importance during model training.
4. Data Splitting:
The dataset was split into training and testing sets using an 80:20 ratio to evaluate the models
effectively.
5. Data Balancing:
Since the dataset was highly imbalanced, SMOTE (Synthetic Minority Oversampling
Technique) was applied to balance the classes and ensure fair model training.
5. Feature Selection
Feature selection was performed using the Boruta algorithm, which helps in identifying the most
important features for the classification task. Boruta is a feature selection method that iteratively removes
features deemed irrelevant by comparing them to random 'shadow' features.
The following columns were selected using Boruta for the final model:
step
oldbalance_org
newbalance_orig
newbalance_dest
diff_new_old_balance
diff_new_old_destiny
type_TRANSFER
These selected features were used in training the machine learning models, as they were determined to
have the highest relevance to detecting fraudulent transactions.
Multiple machine learning models were implemented, evaluated, and compared based on their
performance metrics. Below is a detailed explanation of each model, its underlying mechanism, and its
performance in fraud detection.
1. Dummy Classifier: A baseline model that performed poorly, with a balanced accuracy of 0.5 and
no predictive power (precision, recall, F1, and Kappa all at 0.0).
2. Logistic Regression: Showed high precision (1.0) but low recall (0.129), indicating it was good
at identifying non-fraudulent transactions but missed many fraudulent ones.
3. LightGDM: Achieved moderate performance with balanced accuracy of 0.681, but low precision
(0.27) and recall (0.364), making it less effective for fraud detection.
4. Support Vector Machine (SVM): High precision (1.0) but low recall (0.192), similar to logistic
regression, failing to detect many fraudulent transactions.
5. K-Nearest Neighbors (KNN): Showed strong precision (0.943) and moderate recall (0.411),
making it effective for detecting non-fraudulent transactions but missing some fraud cases.
6. Random Forest: Performed well with high balanced accuracy (0.861), precision (0.969), and
recall (0.721), making it suitable for fraud detection.
7. XGBoost: The best-performing model with the highest balanced accuracy (0.887), precision
(0.938), and recall (0.775), making it the most effective for detecting fraud.
Hyperparameter tuning was performed to optimize the XGBoost model. The process involved adjusting
parameters such as the learning rate, max depth, and number of estimators. The tuning significantly
improved the model’s performance, resulting in the following metrics:
After training the model with the selected hyperparameters, we evaluated its performance on the unseen
test data. The final model's predictions were compared against the actual labels from the test set, and the
following performance metrics were obtained:
Balanced Accuracy: 0.912
Precision: 0.957
Recall: 0.823
F1-Score: 0.885
Kappa: 0.885
These results indicate that the model performs well on unseen data, with a high balanced accuracy and
precision, suggesting that it is correctly classifying most of the positive instances. The recall value reflects
a reasonable ability to detect positive instances, though there is still room for improvement. The F1-score
and Kappa indicate a strong agreement between predicted and actual labels, demonstrating good overall
model performance and robustness.
9. Conclusion
This project successfully developed and evaluated various machine learning models for detecting
fraudulent transactions. After extensive testing and evaluation, XGBoost emerged as the best model due
to its high accuracy, precision, and recall, making it suitable for deployment in a real-world fraud
detection system.
Future improvements could include real-time prediction capabilities, integration with a live payment
processing system, and continuous model retraining to handle evolving fraud patterns.
10. References
Fraud Detection with Machine Learning: Identifying Suspicious Patterns in Financial Transactions | by
Zhong Hong | Medium
https://www.sciencedirect.com/science/article/pii/S2772662223000036
https://cs229.stanford.edu/proj2018/report/261.pdf
https://ieeexplore.ieee.org/document/9004231