Report 16 TH
Report 16 TH
By
Bachelor of Technology
in
Information Technology
1
DECLARATION
We hereby declare that this submission is our own work and that, to the best of our
knowledge and belief, it contains no material previously published or written by another
person nor material which to a substantial extent has been accepted for the award of
any other degree or diploma of the university or other institute of higher learning, except
where due acknowledgment has been made in the text.
Date :
Date :
Date :
Date :
2
CERTIFICATE
This is to certify that Project Report entitled “Fraud Detection In Financial Sector” which is
submitted by Anupam, Akash, Akhil, Mayank in partial fulfillment of the requirement for the
award of degree B. Tech. in Department of Information Technology of Dr APJ Abdul Kalam
Technical University, is a record of the candidate own work carried out by him under my/our
supervision. The matter embodied in this thesis is original and has not been submitted for the
award of any other degree.
3
ACKNOWLEDGEMENT
It gives us a great sense of pleasure to present the report of the B. Tech Project undertaken during B.
Tech. Final Year. We owe special debt of gratitude to Professor Vaishali Tyagi, Department of
Information Technology, JSS Academy Of Technical Education, Noida for her constant support and
guidance throughout the course of our work. Her sincerity, thoroughness and perseverance have been a
constant source of inspiration for us. It is only her cognizant efforts that our endeavors have seen light
of the day.
We also take the opportunity to acknowledge the contribution of Professor Dhiraj Pandey, Head,
Department of Information Technology, JSS Academy Of Technical Education, Noida for his full
support and assistance during the development of the project.
We also do not like to miss the opportunity to acknowledge the contribution of all faculty members of
the department for their kind assistance and cooperation during the development of our project. Last
but not the least, we acknowledge our friends for their contribution in the completion of the project.
Date :
Date :
Date :
Date :
4
5
ABSTRACT
Payments related fraud is a key aspect of cyber-crime agencies and recent research has shown
that machine learning techniques can be applied successfully to detect fraudulent transactions
in large amounts of payments data. Such techniques have the ability to detect fraudulent
transactions that human auditors may not be able to catch and also do this on a real time
basis.
In this project, we apply multiple supervised machine learning techniques to the problem of
fraud detection using a publicly available simulated payment transactions data. We aim to
demonstrate how supervised ML techniques can be used to classify data with high class
imbalance with high accuracy.
We demonstrate that exploratory analysis can be used to separate fraudulent and non-
fraudulent transactions. We also demonstrate that for a well separated dataset, tree based
algorithms like Random Forest work much better than Logistic Regression.
6
TABLE OF CONTENTS Page
DECLARATION ................................................................................................... ii
CERTIFICATE ..................................................................................................... iii
ACKNOWLEDGEMENTS .................................................................................. iv
ABSTRACT ........................................................................................................... v
LIST OF TABLES.................................................................................................. vii
LIST OF FIGURES................................................................................................ viii
LIST OF SYMBOLS .............................................................................................. ix
LIST OF ABBREVIATIONS ................................................................................ x
CHAPTER 1 1
1.1. ................................................................................................................. 5
1.2. ................................................................................................................. 8
CHAPTER 2 ……………………………. ......................................................... 13
2.1. .................................................................................................................. 15
2.2. .................................................................................................................. 17
2.2.1. ......................................................................................................... 19
2.2.2. ......................................................................................................... 20
2.2.2.1. ................................................................................................ 21
2.2.2.2. .......................................................................................... 22
2.3. ................................................................................................................. 23
CHAPTER 3 …………………………….......................................................... 30
3.1. ................................................................................................................ 36
3.2. ................................................................................................................ 39
CHAPTER 4 (CONCLUSIONS) ......................................................................... 40
APPENDIX A ......................................................................................................... 45
APPENDIX B ......................................................................................................... 47
REFERENCES... .................................................................................................... 49
7
LIST OF TABLES
8
LIST OF FIGURES
9
LIST OF SYMBOLS
≠ Not Equal
Belongs to
€ Euro- A Currency
_ Optical distance
10
(Example)
LIST OF ABBREVIATIONS
11
CHAPTER 1
1.1 Introduction
Digital payments of various forms are rapidly increasing across the world. Payments companies
are experiencing rapid growth in their transactions volume. For example, PayPal processed
~$578 billion in total payments in 2018. Along with this transformation, there is also a rapid
increase in financial fraud that happens in these payment systems.
Preventing online financial fraud is a vital part of the work done by cyber security and cyber-
crime teams. Most banks and financial institutions have dedicated teams of dozens of analysts
building automated systems to analyze transactions taking place through their products and
flag potentially fraudulent ones. Therefore, it is essential to explore the approach to solving the
problem of detecting fraudulent entries/transactions in large amounts of data in order to be
better prepared to solve cyber-crime cases.
1.1.1 Motivation
1.1.2 Objective
12
use machine learning techniques to generate alerts, recommendations, actions, or
decisions based on the fraud risk level of each entity.
Regulatory Compliance: Ensure that financial institutions are equipped with the tools
and systems required to meet regulatory obligations regarding fraud detection and
reporting.
Improved Accuracy: It will use machine learning models to classify, score, rank, or
predict the fraud risk level of each transaction, customer, account, device, location,
etc.
1.1.3 Scope
In this proposed project we designed a protocol or a model to detect the fraud activity in
financial transactions. This system would be capable of providing most of the essential
features required to detect fraudulent and legitimate transactions. As technology changes, it
becomes difficult to track the Modelling and pattern of fraudulent transactions. With the rise
of machine learning, artificial intelligence and other relevant fields of information
technology, it becomes feasible to automate this process and to save some of the intensive
amount of labor that is put into detecting credit card fraud.
Phase 1: Business Understanding
As stated before credit card fraud is increasing drastically every year, many people are facing
the problem of having their credits breached by those fraudulent people, which is impacting
their daily lives, as payments using a credit card is similar to taking a loan. If the problem is
not solved many people will have large amounts of loans that they cannot pay back which
will make them face a hard life, and they won’t be able to afford necessary products, in the
long run not being able to pay back the amount might lead to them going to jail. Basically,
the problem proposed is the detection of the credit card fraudulent transactions made by
fraudsters to stop those breaches and to ensure customers security.
Phase 2: Data Understanding
In the Data understanding phase, it was critical to obtain a high-quality dataset as the model
is based on it, the dataset was explored by taking a closer look into it which gave the
knowledge needed to confirm the quality of the dataset, additionally to reading the
description of the whole dataset and each attribute. It’s also important to have a dataset that
contains several mixed transaction types “Fraudulent and real” and a class to clarify the type
of transaction, finally, identifiers to clarify the reason behind the classification of 3 the
transaction type. I made sure to follow all of those points during the search for the most
suited dataset.
Phase 3: Data Preparation
After choosing the most suited dataset the preparation phase begins, the preparation of the
dataset includes selecting the wanted attributes or variables, cleaning it by excluding Null
13
rows, deleting duplicated variables, treating outlier if necessary, in addition to transforming
data types to the wanted type, data merging can be performed as well where two or more
attributes get merged. All those alterations lead to the wanted result which is to make the data
ready to be modelled.
Phase 4: Modelling
Four machine learning models will be used in modelling phase, KNN, SVM, Logistic
Regression and Naïve Bayes. A comparison of the results will be presented later in the paper
to know which technique is most suited in the credit card fraudulent transactions detection.
The dataset is will be sectioned into a ratio of 80:20, the training set will be the 80% and
remaining set will be the testing set which is the 20%.
Phase 5: Evaluation and Deployment
The final phase will show evaluations of the models by presenting their efficiency, the
accuracies of the models will be presented in addition to any comment observed, to find the
best and most suited model for detecting the fraud transactions made by credit card.
Significant research and practical applications have been conducted in the field of fraud
detection within the financial sector. Key contributions include:
Machine Learning Approaches: Research exploring the use of machine learning algorithms
such as Random Forests, Neural Networks, Support Vector Machines, and Gradient Boosting
for fraud detection.
Anomaly Detection: Research into the application of anomaly detection techniques, including
clustering, outlier detection, and network analysis, to identify irregularities in financial data.
14
CHAPTER 2
To detect financial fraud, researchers typically use outlier detection techniques (Jayakumar
et.al, 2013) with highly imbalanced datasets. Different types of financial frauds are also
possible. One article suggests four categories of financial fraud – financial statement fraud,
transaction fraud, insurance fraud and credit fraud (Jan’s et al., 2011). In this project, the focus
is on transaction fraud specifically as it applies to mobile payments and deep fake Voice fraud.
15
2.2.1 Deepfake Audio Detection via MFCC Features Using Machine
Learning
16
2.2.2 Deep fake Audio Detection: A Deep Learning Based Solution for Group
Conversations
We built Deep Neural Network models and integrated them into a single solution using
different datasets, including but not limited to UrbanSound8K (5.6GB), Conversational
(12.2GB), AMI-Corpus (5GB), and FakeOrReal (4GB).
Our proposed approach consists of four main components. The speech-denoising component
cleans and preprocesses the audio using Multilayer-Perceptron and Convolutional Neural
Network architectures, with 93% and 94% accuracies accordingly.
The speaker diarization was implemented using two different approaches, Natural Language
Processing for text conversion with 93% accuracy and Recurrent Neural Network model for
speaker labeling with 80% accuracy and 0.52 Diarization-Error-Rate.
The final component distinguishes between real and fake audio using a CNN architecture
with 94% accuracy. With these findings, this research will contribute immensely to the
domain of speech analysis.
17
2.2.3 Detecting Deep fake Voice Using Explainable Deep Learning
Techniques
In this paper, we present a human perception level of interpretability for deepfake audio
detection.
For the general interpretation of the detection model, two datasets with exclusive
characteristics were used. The first set of experiments was conducted upon the ASVspoof
2021 Logical Access dataset. ASVspoof consists of 2580 bona fide user speech data collected
from 107 speakers and the corresponding 22,800 synthesized speech data generated using 19
synthesizers.
As mentioned earlier, visualized interpretation with XAI methods for image classification
that provides the output in the form of a heatmap is often acceptable. If the classification
accuracy is high enough, the ensuing XAI result also tends to proceed properly. However,
current XAI methods are not perfect, and often fail to separate an object from the
background, eventually only highlighting the high-contrast object contour
18
2.2.4 The Effect of Deep Learning Methods on Deepfake Audio Detection
for Digital Investigation
Voice cloning methods have been used in a range of ways, from customized speech interfaces
for marketing to video games. Current voice cloning systems are smart enough to learn speech
characteristics from a few samples and produce perceptually unrecognizable speech. These
systems pose new protection and privacy risks to voice-driven interfaces. Fake audio has been
used for malicious purposes and is difficult to classify what is real and fake during a digital
forensic investigation. This paper reviews the issue of deep-fake audio classification and
evaluates the current methods of deep-fake audio detection for forensic investigation. Audio
file features were extracted and visually presented using MFCC, Mel-spectrum, Chromagram,
and spectrogram representations to further study the differences.
19
2.2.5 REAL-TIME DETECTION OF AI-GENERATED SPEECH FOR
DEEPFAKE VOICE CONVERSION
There are growing implications surrounding generative AI in the speech domain that enable
voice cloning and real-time voice conversion from one individual to another. This technology
poses a significant ethical threat and could lead to breaches of privacy and misrepresentation,
thus there is an urgent need for real-time detection of AI-generated speech for DeepFake
Voice Conversion. To address the above emerging issues, the DEEP-VOICE dataset is
generated in this study, comprised of real human speech from eight well-known figures and
their speech converted to one another using Retrieval-based Voice Conversion. Presenting as
a binary classification problem of whether the speech is real or AI-generated, statistical
analysis of temporal audio features through t-testing reveals that there are significantly
different distributions. Hyperparameter optimisation is implemented for machine learning
models to identify the source of speech. Following the training of 208 individual machine
learning models over 10-fold cross validation, it is found that the Extreme Gradient Boosting
model can achieve an average classification accuracy of 99.3% and can classify speech in
real-time, at around 0.004 milliseconds given one second of speech. All data generated for
this study is released publicly for future research on AI speech detection.
20
System Design And Methodology
- Train the model, tune hyper parameters, and validate its performance.
21
3. Model Evaluation & Optimization:
- Optimize the model based on evaluation results, fine-tuning for better accuracy.
- Periodically update or retrain the model with new data for continuous accuracy.
- Ensure the solution complies with financial regulations and industry standards throughout
the process.
22