0% found this document useful (0 votes)
14 views9 pages

Financial Fraud Detection Using Machine Learning

Uploaded by

gixali5216
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views9 pages

Financial Fraud Detection Using Machine Learning

Uploaded by

gixali5216
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Financial Fraud Detection using Machine Learning

Abstract−¿Recent research demonstrates that machine learning techniques can


be highly effective for detecting fraud in payment systems. These ML-based approaches
have the capacity to adapt and identify previously unseen fraud patterns. In this paper,
we apply multiple ML techniques, including logistic regression, support vector
machines, K-Nearest Neighbors and Random Forest, to detect payment fraud using a
labelled dataset of transaction records. Our results show that the proposed approaches
achieve high accuracy in identifying fraudulent transactions while maintaining a
reasonably low rate of false positives.
I. INTRODUCTION
The global shift toward digital payment systems has led to significant growth in
transaction volumes, with companies like PayPal Inc. processing $143 billion USD in a
single quarter in 2018 [4]. However, this increase in digital transactions also correlates with a
rise in financial fraud.
Effective fraud detection systems need to identify fraudulent transactions with high
accuracy while minimizing false positives to avoid inconveniencing genuine users. Excessive
false positives can negatively impact customer experience, potentially leading users to take
their business elsewhere. The challenge in fraud detection lies in the highly imbalanced
datasets, where fraudulent transactions represent a small fraction of the data.
In this paper, we evaluate the performance of multiple binary classification models—
Logistic Regression, Linear SVM, and SVM with an RBF kernel—on a labelled dataset of
payment transactions. Additionally, we compare these models with K-Nearest Neighbors
(KNN) and Random Forest to assess their effectiveness in distinguishing fraudulent
transactions from legitimate ones. Our goal is to develop classifiers that achieve high fraud
detection accuracy while maintaining a low false-positive rate.
II. RELEVANT RESEARCH
Various ML and non-ML approaches have been applied to address the challenge of
fraud detection in payment systems. Paper [1] provides a comprehensive review and
comparison of multiple state-of-the-art techniques, datasets, and evaluation criteria for this
problem, examining both supervised and unsupervised ML approaches, including Artificial
Neural Networks (ANN), Support Vector Machines (SVM), Hidden Markov Models (HMM),
and clustering methods. Paper [5] introduces a rule-based approach to tackle fraud detection,
while paper [3] addresses the issue of imbalanced data, which can lead to high false positive
rates, and proposes methods to mitigate this effect. In [2], the authors present an SVM-based
technique for detecting metamorphic malware, discussing strategies to manage the
imbalanced nature of such datasets, where malware samples are significantly fewer than
benign files, to achieve high detection precision and accuracy.
III. DATASET AND ANALYSIS
For this project, we use a Kaggle dataset [8] of simulated mobile-based payment
transactions. We begin by categorizing the data based on the different transaction types it
includes. Additionally, we apply Principal Component Analysis (PCA) to visualize data
variability in a two-dimensional space. The dataset consists of five labeled transaction
categories: 'CASH IN,' 'CASH OUT,' 'DEBIT,' 'TRANSFER,' and 'PAYMENT.' Details of
each category are outlined in Table I.
Transaction Type Non-fraud Fraud transactions Total
transactions Total
CASH IN 1399284 0 1399284
CASH OUT 2233384 4116 2237500
TRANSFER 528812 4097 532909
DEBIT 41432 0 41432
PAYMENT 2151494 0 2151494
TOTAL 6354407 8213 6362620

TABLE I: Paysim dataset statistics

Paysim dataset consists of both numerical and categorical features like transaction
type, amount transferred, account numbers of sender and recipient accounts. In our
experiments we use the following features to train our models.
1) Transaction type
2) Transaction amount
3) Sender account balance before transaction
4) Sender account balance after transaction
5) Recipient account balance before transaction
6) Recipient account balance after transaction
IV. METHOD
Our goal is to separate fraud and non-fraud transactions by obtaining a decision
boundary in the feature space defined by input transactions. Each transaction can be
represented as vector of its feature values. We have built binary classifiers using Logistic
regression, linear SVM and SVM with RBF kernels for TRANSFER and CASH OUT sets
respectively.
A. Logistic Regression
Logistic regression is a technique used to find a linear decision boundary for a binary
classifier. For a given input feature vector x, a logistic regression model with parameter θ
1
classifies the input x using the following hypothesis hθ ( x )=g(θ ¿¿ T x)= −θ T x
¿ where g is
1+e
known as Sigmoid function. For a binary classification problem, the output hθ ( x ) can be
interpreted as a probability of x as belonging to class 1. The logistic loss function with
respect to parameters θ can be given as
m
1
J (θ)= ∑ log ⁡¿ ¿
m i=1

B. Support Vector Machine


Support vector machine creates a classification hyper plane in the space defined by
input feature vectors. The training process aims to determine a hyper-plane that maximizes
geometric margin with respect to labelled input data. SVMs optimization problem can be
characterized by
n
1
min γ , ω ,b ¿∨ω∨¿ 2+C ∑ ϵ i
2 k=0

s . t . y i ( ωT x (i )+b ) ≥1−ϵ i , i=1 , … , m


ϵ i≥0 ,i=1 , … , m

In this project we use two variants of SVM - linear SVM and SVM based on RBF
kernel. An SVM based on RBF kernel can find a non-linear decision boundary in input space.
The RBF kernel function on two vectors x and z in the input space can be defined as
2
−¿∨ x−z∨¿
K ( x , z)=exp ⁡( 2
)

C. Class weights-based approach
We assign different weights to samples belonging fraud and non-fraud classes for
each of the three techniques respectively. Such an approach has been used to counter data
imbalance problem - with only 0.13 percent fraud transactions available to us. In a payments
fraud detection system, it is more critical to catch potential fraud transactions than to ensure
all non-fraud transactions are executed smoothly. In our proposed approaches, we penalize
mistakes made in misclassifying fraud samples more than misclassifying nonfraud samples.
We trained our models (for each technique) using higher class weights for fraud samples
compared to non-fraud samples.
We fine tune our models by choosing class weights to obtain desired/optimal balance
between precision and recall scores on our fraud class of samples. We have chosen class
weights such that we do not have more than around 1 percent of false positives on CV set.
This design trade off enables us to establish a balance between detecting fraud transactions
with high accuracy and preventing large number of false positives. A false positive, in our
case, is when a non-fraud transaction is misclassified as a fraudulent one.
D. K-Nearest Neighbors
KNN classifies a transaction based on the majority class among its nearest neighbors.
While intuitive, KNN can be highly sensitive to the imbalance, often resulting in higher false
positives for the minority (fraud) class. In our experiments, we set the optimal number of
neighbors based on cross-validation.

√∑ (
n
p=( p 1 , p2 , … p n )∧q=( q 1 ,q 2 ,… qn ) ∈n−dimentional space D ( p , q )= pi −qi )
2

i=1

D(x test
, x 1) , D(x test
, x 2) , … D(x
test
, x k) be the K smallest distances

^y =mode( y 1 , y 2 … , y k )

E. Random Forest
RF is an ensemble method that uses multiple decision trees. Its robustness to class
imbalance makes it an ideal candidate for fraud detection, as it can adapt class weights or use
balanced sampling. RF’s feature importance also offers insight into the significance of each
attribute in fraud detection.

^y =mode ( ^y 1 , ^y 2 … , ^y T )where ^
y t is the prediction of the t−th tree
T
1
^y =
T
∑ ^y t
t=1

where ^
y t is the prediction of the t−th tree

V. EXPERIMENTS
In this section, we describe our dataset split strategy and training, validation and
testing processes that we have implemented in this work. All software was developed using
Scikit-learn [7] ML library.
A. Dataset split strategy
We divided our dataset based on different transaction types described in the dataset
section. In particular, we use TRANSFER and CASH OUT transactions for our experiments
since they contain fraud transactions. For both types, we divided respective datasets into three
splits - 70 percent for training, 15 percent for CV and 15 percent for testing purposes. We use
stratified sampling in creating train/CV/test splits. Stratified sampling allows us to maintain
the same proportion of each class in a split as in the original dataset. Details of the splits are
mentioned in tables II and III.

TRANSFER
Split Fraud Non-Fraud Total
Train 2868 370168 373036
CV 614 79322 79936
Test 615 79322 79937
Total 4097 528812 532909

TABLE II: Dataset split details

B. Model training and tuning


We employed a class weight-based approach as described in the previous section to
train each of our models. Each model was trained multiple times using increasing weights for
fraud class samples. At the end of each iteration, we evaluated our trained models by

gave us the highest recall on fraud class with not more than ∼ 1 percent false positives.
measuring their performance on CV split. For each model, we chose the class weights which

Finally, we used the models trained using a chosen set of class weights to make
predictions on our test dataset split. In the next section, we elaborate on our choice of class
weights based on their performance on the CV set. We also discuss their performance on train
and test sets.

VI. RESULTS AND DISCUSSION


In this section, we discuss results obtained in training, validation and testing phases.
We evaluated performance of our models by computing metrics like recall, precision, f1
score.
A. Class weights selection
In our experiments, we used increasing weights for fraud samples. We initially
considered making class weights equal to imbalance ratio in our dataset. This approach
seemed to give good recall but also resulted in very high number of false positives - >> 1
percent - especially for CASH OUT. Hence, we did not use this approach and instead tuned
our models by trying out multiple combinations of weights on our CV split.
Overall, we observed that higher class weights gave us higher recall at the cost of
lower precision on our CV split.
CASH OUT
Split Fraud Non-Fraud Total
Train 2881 1563369 1566250
CV 618 335007 335625
Test 617 335008 335625
Total 4116 2233384 237500

TABLE III: Dataset split details

For TRANSFER dataset, the effect of increasing weights is less prominent, in


particular for Logistic Regression and Linear SVM algorithms. That is, equal class weights
for fraud and non-fraud samples give us high recall and precision scores. Based on these
results, we still chose higher weights for fraud samples to avoid over-fitting on CV set.

TRANSFER TRANSFER
Algorithm Recall Precision F1-measure Algorithm Recall Precision F1-measure
Logistic Regression 0.9983 0.4416 0.6123 Logistic Regression 0.9958 0.4452 0.6153
Linear SVM 0.9983 0.4432 0.6139 Linear SVM 0.9958 0.4431 0.6133
SVM with RBF kernel 0.9934 0.5871 0.7381 SVM with RBF kernel 0.9958 0.6035 0.7515
KNN 0.9267 0.9618 0.9439 KNN 0.9328 0.9732 0.9526
Random Forest 0.9973 0.9995 0.9984 Random Forest 0.9975 1.0000 0.9987
CASH OUT CASH OUT
Algorithm Recall Precision F1-measure Algorithm Recall Precision F1-measure
Logistic Regression 0.9822 0.1561 0.7235 Logistic Regression 0.9847 0.1514 0.2665
Linear SVM 0.9352 0.1263 0.6727 Linear SVM 0.9361 0.1225 0.2119
SVM with RBF kernel 0.9773 0.1315 0.7598 SVM with RBF kernel 0.9875 0.1355 0.2383
KNN 0.5778 0.8758 0.6958 KNN 0.5674 0.8895 0.6928
Random Forest 0.9939 0.9963 0.9951 Random Forest 0.9927 0.9987 0.9957
TABLE IV: Results on CV set TABLE V: Results on Train set

B. Results on train and test sets


In this section, we discuss results on train and test sets using chosen class weights.

set. We get very high recall for TRANSFER transactions with ∼ 0.99 recall score for all five
Table V summarize the results on train set. Similarly, table VI summarize the results on test

algorithms. Table VII displays corresponding confusion matrices obtained on test set of
TRANSFER. We are able to detect more than 600 fraud transactions for all five algorithms
with less than 1 percent false positives. TRANSFER transactions had shown a high
variability across their two principal components when we performed PCA on it. This set of
transactions seemed to be linearly separable - with all five of our proposed algorithms
expected to perform well on it. We can see this is indeed the case.
For CASH OUT transactions, we obtain fewer promising results compared to
TRANSFER for both train and test sets. Logistic regression and linear SVM have similar
performance. SVM with RBF gives a higher recall but with lower precision on average for
this set of transactions. A possible reason for this outcome could be non-linear decision
boundary computed using RBF kernel function. However, for all five algorithms, we can
obtain high recall scores if we are more tolerant to false positives. In the real world, this is
purely a design/business decision and depends on how many false positives is a payments
company willing to tolerate.
TRANSFER
Algorithm Recall Precision F1-measure
Logistic Regression 0.9951 0.4444 0.6144
Linear SVM 0.9951 0.4516 0.6213
SVM with RBF kernel 0.9886 0.5823 0.8949
KNN 0.9328 0.9732 0.9526
Random Forest 0.9975 1.0000 0.9987
CASH OUT
Algorithm Recall Precision F1-measure
Logistic Regression 0.9886 0.1521 0.2636
Linear SVM 0.9411 0.1246 0.6893
SVM with RBF kernel 0.9789 0.1321 0.7271
KNN 0.5674 0.8895 0.6928
Random Forest 0.9927 0.9987 0.9957

TABLE VI: Results on Test set

Overall, we observe that all our proposed approaches seem to detect fraud
transactions with high accuracy and low false positives - especially for TRANSFER
transactions. With more tolerance to false positives, we can see that it can perform well on
CASH OUT transactions as well.

TABLE VII: Confusion matrices

(a) Logistic Regression (b) Linear SVM


Pred Pred
- + - +
True - 78557 765 True - 78579 743
+ 3 612 + 3 612

(c) SVM with RBF kernel (d) K-Nearest Neighbors


Pred Pred
- + - +
True - 78886 436 True - 1270877 54
+ 7 608 + 436 1211

(e) Random Forest


Pred
- +
True - 1270877 0
+ 5 1642

VII. CONCLUSION AND FUTURE WORK


This study has demonstrated that machine learning techniques, particularly Logistic
Regression, SVM (linear and RBF kernel), Random Forest (RF), and K-Nearest Neighbors
(KNN), are effective tools for financial fraud detection. Among these, Logistic Regression
and SVM models excel at handling highly imbalanced datasets, achieving high recall with
minimal false positives, particularly for "TRANSFER" transaction types. Random Forest
offers a robust alternative due to its adaptability to class imbalance and interpretability, while
KNN showed limited effectiveness due to its sensitivity to minority class imbalance.
Our findings highlight the significance of model selection based on specific fraud detection
requirements, balancing recall and precision to minimize the inconvenience to genuine users.
The study underscores the challenges posed by imbalanced data and the critical need for
optimized class weights and tuning to achieve desired outcomes.
Future research can explore ensemble learning approaches or integrate time-series
methodologies to enhance fraud detection performance further. Such advancements could
improve the ability to identify evolving fraud patterns, ensuring more secure payment
systems.

APPENDIX
GitHub code link - https://github.com/dhhanushvanama/Fraud-Detection
VIII. ACKNOWLEDGEMENT
We would like to extend my sincere gratitude to Professor Thatiparthy Bharath Kumar
and the entire teaching staff for delivering a remarkably well-organized and expertly taught
class. Your dedication and clear instruction have made this learning experience truly
enriching. Thank you for your hard work and commitment.
REFERENCES
[1] A Survey of Credit Card Fraud Detection Techniques: Data and Technique Oriented
Perspective - Samaneh Sorournejad, Zojah, Atani et.al - November 2016
[2] Support Vector machines and malware detection - T.Singh,F.Di Troia, C.Vissagio , Mark
Stamp - San Jose State University - October 2015
[3] Solving the False positives problem in fraud prediction using automated feature
engineering - Wedge, Canter, Rubio et.al - October 2017
[4] PayPal Inc. Quarterly results https://www.paypal.com/stories/us/paypalreports-third-
quarter-2018-results
[5] A Model for Rule Based Fraud Detection in Telecommunications - Rajani,
Padmavathamma - IJERT - 2012
[6] HTTP Attack detection using n−gram analysis - A. Oza, R.Low, M.Stamp - Computers
and Security Journal - September 2014
[7] Scikit learn - machine learning library http://scikit-learn.org
[8] Paysim - Synthetic Financial Datasets For Fraud Detection https://www.kaggle.com/ntnu-
testimon/paysim1
[9] Awoyemi, John O., Adebayo O. Adetunmbi, and Samuel A. Oluwadare. "Credit card fraud
detection using machine learning techniques: A comparative analysis." 2017 international
conference on computing networking and informatics (ICCNI). IEEE, 2017.
[10] Raghavan, Pradheepan, and Neamat El Gayar. "Fraud detection using machine learning
and deep learning." 2019 international conference on computational intelligence and
knowledge economy (ICCIKE). IEEE, 2019.

You might also like