Visvesvaraya Technological University "Jnana Sangam", Belagavi-590018
Visvesvaraya Technological University "Jnana Sangam", Belagavi-590018
REPORT ON
Bachelor of Engineering
In
CERTIFICATE
This is to certify that the MACHINE LEARNING MINI PROJECT(BCM601) entitled
“Credit Card Fraud Detection” carried out by AJAY H M , MANOJ K,
SHASHANK M , SRINIDHI N bearing USN 1KS22CG002 , 1KS22CG30 ,
1KS22CG043 , 1KS22CG049 bonafide students of K.S. Institute of
Technology in the partial fulfilment for the award of the Bachelor of
Engineering in Computer Science & Design of the Visvesvaraya
Technological University, Belagavi, during the year 2024-25. It is certified that
all corrections/suggestions indicated for Internal Assessment have been
incorporated in the report deposited in the departmental library. The mini project
report has been approved as it satisfies the academic requirements in respect of
Mini Project work prescribed for the said degree for the 6th semester.
Dr. Deepa S R
Prof & HOD, CS&D Department
ACKNOWLEDGEMENT
I take this opportunity to thank everyone involved in making this project. I would like to thank the
college for providing us an opportunity to work on the project.
I would like to thank the management of K.S.Institute of Technology for providing all the required
resources for the project.
I would like to thank our Head of the Department of Computer Science and Design,
Dr. Deepa SR.
I also thank all the other teaching and non-teaching staff members for supporting andcooperating
while making the project.
AJAY H M [1KS22CG002]
MANOJ K [1KS22CG030]
SHASHANK M [1KS22CG043]
SRINIDHI N[1KS22CG049]
ABSTRACT
Credit card fraud is a significant concern for financial institutions, businesses, and customers in today's
digital economy. With the massive rise in e-commerce and cashless transactions, fraudulent activities have
grown in sophistication and volume. This project aims to implement a machine learning-based fraud
detection system that can accurately and efficiently identify potentially fraudulent credit card transactions.
The dataset used in this project contains European cardholder transactions and includes features that have
been anonymized using PCA transformations to maintain privacy. One of the biggest challenges in fraud
detection is the class imbalance—fraudulent transactions make up a very small percentage of all
transactions. Thus, a major part of the project involved data balancing techniques such as SMOTE.
The project involved several supervised learning models, including Logistic Regression, Decision Trees,
Random Forest, SVM, and KNN. Among these, the Random Forest model achieved the highest F1-score
and ROC-AUC score. Model evaluation was done using confusion matrices and precision-recall metrics,
considering the imbalance in data. The results demonstrate that machine learning models, especially
ensemble-based methods, can be highly effective for fraud detection tasks.
This system can be further enhanced and adapted into real-time fraud monitoring frameworks used by
banks and fintech companies to minimize losses and increase customer trust.
CONTENTS
1 Introduction 6-7
2 Objectives 8 - 10
4 Results 15 – 17
5 Conclusion 18
6 References 19
1. Introduction
In today’s digital age, the global financial ecosystem is undergoing a rapid transformation.
From online shopping and contactless payments to mobile banking, financial transactions are
becoming more seamless and accessible than ever before. While this digital revolution offers
convenience and speed, it also exposes individuals and organizations to a heightened risk of
cybercrime—particularly credit card fraud.
Credit card fraud is defined as the unauthorized use of someone’s credit card details to make
purchases or withdraw funds. With the growing reliance on digital transactions, fraudsters
have developed increasingly sophisticated techniques to bypass traditional security
mechanisms. These include phishing attacks, identity theft, database breaches, and card
skimming devices. As a result, credit card fraud has emerged as one of the most prevalent
and damaging forms of financial crime, affecting millions of users and costing financial
institutions billions of dollars annually.
The pressure on banks, merchants, and financial institutions to prevent fraud has never been
higher. Historically, most fraud detection systems were built using rule-based approaches,
where a set of pre-defined conditions (like location anomalies or unusually high transaction
amounts) would flag a transaction as suspicious. Although these systems were effective to an
extent, they suffered from two major drawbacks: high false positive rates (flagging legitimate
transactions as fraud) and low adaptability to new fraud patterns. As fraud strategies evolved,
these static rules quickly became outdated, reducing the effectiveness of such systems.
To address these limitations, this project investigates the application of machine learning
(ML)—a subset of artificial intelligence that enables computers to learn from data without
being explicitly programmed. Machine learning models are capable of discovering complex
patterns and relationships in large datasets. When applied to fraud detection, they can analyze
historical transaction data to distinguish between legitimate and fraudulent behavior with a
high degree of accuracy. Moreover, these models can be updated and retrained continuously,
making them adaptive to new and emerging fraud techniques.
A major challenge in building effective fraud detection systems is dealing with imbalanced
datasets. In most real-world scenarios, fraudulent transactions make up less than 1% of all
6
transactions. This imbalance makes it difficult for traditional models to learn useful features
of fraud behavior, as they are overwhelmed by the sheer volume of legitimate cases. To
overcome this, the project applies techniques such as Synthetic Minority Over-sampling
Technique (SMOTE) to create a balanced dataset that allows machine learning models to
learn effectively from both classes.
The dataset used in this project is publicly available from Kaggle, sourced from transactions
made by European cardholders in 2013. It consists of 284,807 transactions, of which only
492 are labeled as fraudulent. The features have been anonymized using Principal
Component Analysis (PCA) to protect sensitive information, with the exception of Time and
Amount.
This report outlines the approach taken to preprocess the dataset, the machine learning models
employed (including Logistic Regression, Decision Trees, Random Forests, Support Vector
Machines, and K-Nearest Neighbors), and the evaluation metrics used to measure
performance, such as accuracy, precision, recall, F1-score, and ROC-AUC. The models are
compared in terms of their ability to correctly identify fraudulent activity while minimizing
false alarms.
In conclusion, this project demonstrates how machine learning can serve as a robust and
scalable solution to modern fraud detection challenges. The findings not only contribute to
academic research but also provide insights for real-world implementation in banking and
fintech sectors. Future enhancements may include integration with real-time systems,
incorporation of deep learning models, and usage of live transaction data for dynamic fraud
monitoring.
7
2. Objectives
The primary goal of this project is to develop a robust machine learning system capable of
accurately identifying fraudulent credit card transactions. In a real-world financial setting,
detecting fraud promptly is not only critical for minimizing financial loss but also for
maintaining the trust and confidence of customers. With this overarching aim, the project is
structured around several specific objectives that guide the problem-solving process and
system development.
Before building any predictive model, it is essential to properly explore and preprocess the
data. This objective focuses on understanding the structure of the dataset, checking for missing
values, duplicates, outliers, and anomalies. Since the dataset features have been anonymized
through Principal Component Analysis (PCA) (except for Time and Amount), it is
important to ensure the features are properly scaled and normalized to maintain consistency.
Preprocessing also includes dealing with the highly imbalanced dataset—a critical step to
ensure the machine learning algorithms do not become biased toward the majority class (non-
fraud).
As part of the preprocessing objective, the project aims to experiment with different methods
for handling class imbalance. One such technique is SMOTE (Synthetic Minority Over-
sampling Technique), which creates synthetic examples of the minority class (fraudulent
transactions) based on feature-space similarities between existing minority instances. By
8
applying such techniques, the objective is to ensure that the learning algorithms can better
generalize and detect fraudulent activities without being overwhelmed by the non-fraudulent
majority class.
To evaluate the effectiveness of various algorithms in fraud detection, this project aims to
implement and compare the performance of several supervised machine learning models,
including:
Each of these models brings different strengths and limitations, and the objective is to assess
them on equal grounds using a well-prepared dataset.
Another major objective is to use appropriate evaluation metrics that reflect the imbalanced
nature of the dataset. Traditional metrics like accuracy may not be sufficient, as they can be
misleading in the presence of class imbalance. Instead, the models are evaluated using:
• Precision: The proportion of correctly identified fraud cases among all predicted
frauds.
• Recall (Sensitivity): The proportion of actual fraud cases that were correctly
identified.
• F1-Score: The harmonic mean of precision and recall.
• ROC-AUC Score: The ability of the model to distinguish between classes.
9
• Confusion Matrix: A visual tool to assess the number of true positives, true negatives,
false positives, and false negatives.
The goal is to identify the model that offers the best trade-off between precision and recall, as
both false positives and false negatives carry significant consequences in fraud detection.
Beyond just academic results, a practical objective of the project is to assess how these models
can be deployed in real-time fraud detection systems. This involves understanding the
computational complexity, prediction time, and integration requirements for each model. The
project also explores how such models can be updated regularly with new transaction data to
stay relevant and effective.
The final objective is to document the entire process meticulously to provide insights for
future researchers and developers. This includes recording challenges faced, decisions taken
at various stages, and lessons learned. The report serves as a complete reference for
implementing a machine learning-based fraud detection system.
10
3. Methodology
The primary goal of this project is to detect fraudulent credit card transactions using
unsupervised anomaly detection techniques. The methodology consists of the following
steps:
The dataset used in this project is the publicly available Credit Card Fraud Detection
dataset from Kaggle, which contains transactions made by European cardholders in
September 2013.
This highlights the class imbalance problem, which is a key challenge in fraud detection.
Key observations:
• The dataset includes features named V1 to V28, resulting from PCA (Principal
Component Analysis) for confidentiality.
• Time and Amount are the only non-anonymized features.
• The Class column denotes fraud (1) or legitimate (0) transactions.
Though the dataset is clean and contains no missing values, some important steps were
performed:
11
• Feature Selection: All PCA features along with Amount were used for training.
• Feature Scaling: Necessary for some models (like SVM), although most PCA-
transformed features were already standardized.
• Splitting Data: While labels (y) were used for evaluation, only feature vectors (X)
were used for model training due to the unsupervised approach.
The dataset is highly imbalanced, with less than 0.2% of transactions being fraudulent.
Rather than balancing through oversampling or SMOTE, the project uses unsupervised
models designed to detect rare anomalies.
This simulates a real-world scenario, where new fraud patterns emerge unpredictably and
models must detect them without prior labeling.
a) Isolation Forest
12
c) One-Class SVM
Each model predicted whether a transaction was normal or anomalous. These predictions were
compared to the actual class labels using:
• Confusion Matrix
• Accuracy Score
• Precision
• Recall
• F1 Score
• ROC-AUC (if applicable)
These metrics help evaluate how well the models detect fraud without raising too many false
alarms.
Note: In fraud detection, recall is critical — the model must catch as many fraudulent
transactions as possible, even at the cost of a few false positives.
13
Tool/Library Purpose
pandas, numpy Data handling and preprocessing
matplotlib, seaborn Data visualization
sklearn ML algorithms and evaluation metrics
Google Colab Execution environment
14
4. Results
After applying three different unsupervised learning models — Isolation Forest, Local
Outlier Factor (LOF), and One-Class SVM (though SVM results are not in the final output),
the performance of the models was evaluated using standard classification metrics. The
dataset was highly imbalanced, with only 49 fraud cases among 28,481 samples used in
evaluation.
15
4.1 Isolation Forest
Interpretation:
Isolation Forest performs reasonably well, identifying ~27% of fraudulent transactions
correctly. It maintains very high accuracy on valid transactions but struggles with minority
fraud cases — a known limitation in imbalanced datasets.
Interpretation:
LOF flagged more anomalies (97), but with very poor precision and recall for fraudulent
16
cases. The model overestimates frauds, leading to many false positives and very limited utility
in actual deployment.
17
5. Conclusion
• The rapid growth of digital transactions and online banking has made credit card fraud
detection an increasingly critical task for financial institutions and consumers alike.
Through this project, we successfully explored and implemented machine learning
techniques to identify fraudulent credit card transactions using real-world data.
• The project began with an in-depth understanding of the dataset, which was highly
imbalanced—only 0.17% of all transactions were fraudulent. To address this, we
applied preprocessing techniques such as feature scaling and data balancing methods
like SMOTE, ensuring the models could learn meaningful patterns even from the
minority class.
• Multiple machine learning algorithms were trained and evaluated, including Logistic
Regression, Decision Tree, Random Forest, K-Nearest Neighbors, and Support Vector
Machine (SVM). Among these, Random Forest and SVM generally performed the
best, achieving high recall and ROC-AUC scores, indicating their strong ability to
detect fraudulent transactions with minimal false negatives.
• We evaluated the models using confusion matrices, precision, recall, F1-score, and
ROC-AUC metrics. These metrics helped us understand the trade-offs between
detecting fraud and avoiding false alarms. While models like KNN and Decision Tree
provided relatively simpler interpretability, ensemble models like Random Forest
showed greater predictive power.
• The findings validate that machine learning provides a robust and scalable approach
to fraud detection, outperforming traditional rule-based systems by learning complex
patterns in transaction data. Moreover, with continuous retraining on new data, these
models can adapt to evolving fraud techniques, making them ideal for real-time fraud
prevention systems.
• In conclusion, this project demonstrates that a properly designed and trained machine
learning model can significantly enhance the effectiveness of credit card fraud
detection. For future improvements, integrating deep learning methods, incorporating
real-time streaming data, and deploying the model into a production environment
using APIs or web applications could further increase the system's utility and
efficiency.
18
6. References
1. Dal Pozzolo, A., Caelen, O., Le Borgne, Y. A., Waterschoot, S., & Bontempi, G.
(2015). Calibrating Probability with Undersampling for Unbalanced Classification.
IEEE Symposium Series on Computational Intelligence (SSCI).
https://ieeexplore.ieee.org/document/7376292
2. Kaggle. (2016). Credit Card Fraud Detection Dataset.
https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
3. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
... & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of
Machine Learning Research, 12, 2825–2830.
https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html
4. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002).
SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial
Intelligence Research, 16, 321–357.
https://www.jair.org/index.php/jair/article/view/10302
5. Lemay, J. (2020). Machine Learning for Credit Card Fraud Detection: Practical
Guide with Python. Medium.
https://towardsdatascience.com (For conceptual and practical guidance)
19
20
21
22