Financial Fraud Detection Using Machine Learning
Financial Fraud Detection Using Machine Learning
Paysim dataset consists of both numerical and categorical features like transaction
type, amount transferred, account numbers of sender and recipient accounts. In our
experiments we use the following features to train our models.
1) Transaction type
2) Transaction amount
3) Sender account balance before transaction
4) Sender account balance after transaction
5) Recipient account balance before transaction
6) Recipient account balance after transaction
IV. METHOD
Our goal is to separate fraud and non-fraud transactions by obtaining a decision
boundary in the feature space defined by input transactions. Each transaction can be
represented as vector of its feature values. We have built binary classifiers using Logistic
regression, linear SVM and SVM with RBF kernels for TRANSFER and CASH OUT sets
respectively.
A. Logistic Regression
Logistic regression is a technique used to find a linear decision boundary for a binary
classifier. For a given input feature vector x, a logistic regression model with parameter θ
1
classifies the input x using the following hypothesis hθ ( x )=g(θ ¿¿ T x)= −θ T x
¿ where g is
1+e
known as Sigmoid function. For a binary classification problem, the output hθ ( x ) can be
interpreted as a probability of x as belonging to class 1. The logistic loss function with
respect to parameters θ can be given as
m
1
J (θ)= ∑ log ¿ ¿
m i=1
In this project we use two variants of SVM - linear SVM and SVM based on RBF
kernel. An SVM based on RBF kernel can find a non-linear decision boundary in input space.
The RBF kernel function on two vectors x and z in the input space can be defined as
2
−¿∨ x−z∨¿
K ( x , z)=exp ( 2
)
2σ
C. Class weights-based approach
We assign different weights to samples belonging fraud and non-fraud classes for
each of the three techniques respectively. Such an approach has been used to counter data
imbalance problem - with only 0.13 percent fraud transactions available to us. In a payments
fraud detection system, it is more critical to catch potential fraud transactions than to ensure
all non-fraud transactions are executed smoothly. In our proposed approaches, we penalize
mistakes made in misclassifying fraud samples more than misclassifying nonfraud samples.
We trained our models (for each technique) using higher class weights for fraud samples
compared to non-fraud samples.
We fine tune our models by choosing class weights to obtain desired/optimal balance
between precision and recall scores on our fraud class of samples. We have chosen class
weights such that we do not have more than around 1 percent of false positives on CV set.
This design trade off enables us to establish a balance between detecting fraud transactions
with high accuracy and preventing large number of false positives. A false positive, in our
case, is when a non-fraud transaction is misclassified as a fraudulent one.
D. K-Nearest Neighbors
KNN classifies a transaction based on the majority class among its nearest neighbors.
While intuitive, KNN can be highly sensitive to the imbalance, often resulting in higher false
positives for the minority (fraud) class. In our experiments, we set the optimal number of
neighbors based on cross-validation.
√∑ (
n
p=( p 1 , p2 , … p n )∧q=( q 1 ,q 2 ,… qn ) ∈n−dimentional space D ( p , q )= pi −qi )
2
i=1
D(x test
, x 1) , D(x test
, x 2) , … D(x
test
, x k) be the K smallest distances
^y =mode( y 1 , y 2 … , y k )
E. Random Forest
RF is an ensemble method that uses multiple decision trees. Its robustness to class
imbalance makes it an ideal candidate for fraud detection, as it can adapt class weights or use
balanced sampling. RF’s feature importance also offers insight into the significance of each
attribute in fraud detection.
^y =mode ( ^y 1 , ^y 2 … , ^y T )where ^
y t is the prediction of the t−th tree
T
1
^y =
T
∑ ^y t
t=1
where ^
y t is the prediction of the t−th tree
V. EXPERIMENTS
In this section, we describe our dataset split strategy and training, validation and
testing processes that we have implemented in this work. All software was developed using
Scikit-learn [7] ML library.
A. Dataset split strategy
We divided our dataset based on different transaction types described in the dataset
section. In particular, we use TRANSFER and CASH OUT transactions for our experiments
since they contain fraud transactions. For both types, we divided respective datasets into three
splits - 70 percent for training, 15 percent for CV and 15 percent for testing purposes. We use
stratified sampling in creating train/CV/test splits. Stratified sampling allows us to maintain
the same proportion of each class in a split as in the original dataset. Details of the splits are
mentioned in tables II and III.
TRANSFER
Split Fraud Non-Fraud Total
Train 2868 370168 373036
CV 614 79322 79936
Test 615 79322 79937
Total 4097 528812 532909
gave us the highest recall on fraud class with not more than ∼ 1 percent false positives.
measuring their performance on CV split. For each model, we chose the class weights which
Finally, we used the models trained using a chosen set of class weights to make
predictions on our test dataset split. In the next section, we elaborate on our choice of class
weights based on their performance on the CV set. We also discuss their performance on train
and test sets.
TRANSFER TRANSFER
Algorithm Recall Precision F1-measure Algorithm Recall Precision F1-measure
Logistic Regression 0.9983 0.4416 0.6123 Logistic Regression 0.9958 0.4452 0.6153
Linear SVM 0.9983 0.4432 0.6139 Linear SVM 0.9958 0.4431 0.6133
SVM with RBF kernel 0.9934 0.5871 0.7381 SVM with RBF kernel 0.9958 0.6035 0.7515
KNN 0.9267 0.9618 0.9439 KNN 0.9328 0.9732 0.9526
Random Forest 0.9973 0.9995 0.9984 Random Forest 0.9975 1.0000 0.9987
CASH OUT CASH OUT
Algorithm Recall Precision F1-measure Algorithm Recall Precision F1-measure
Logistic Regression 0.9822 0.1561 0.7235 Logistic Regression 0.9847 0.1514 0.2665
Linear SVM 0.9352 0.1263 0.6727 Linear SVM 0.9361 0.1225 0.2119
SVM with RBF kernel 0.9773 0.1315 0.7598 SVM with RBF kernel 0.9875 0.1355 0.2383
KNN 0.5778 0.8758 0.6958 KNN 0.5674 0.8895 0.6928
Random Forest 0.9939 0.9963 0.9951 Random Forest 0.9927 0.9987 0.9957
TABLE IV: Results on CV set TABLE V: Results on Train set
set. We get very high recall for TRANSFER transactions with ∼ 0.99 recall score for all five
Table V summarize the results on train set. Similarly, table VI summarize the results on test
algorithms. Table VII displays corresponding confusion matrices obtained on test set of
TRANSFER. We are able to detect more than 600 fraud transactions for all five algorithms
with less than 1 percent false positives. TRANSFER transactions had shown a high
variability across their two principal components when we performed PCA on it. This set of
transactions seemed to be linearly separable - with all five of our proposed algorithms
expected to perform well on it. We can see this is indeed the case.
For CASH OUT transactions, we obtain fewer promising results compared to
TRANSFER for both train and test sets. Logistic regression and linear SVM have similar
performance. SVM with RBF gives a higher recall but with lower precision on average for
this set of transactions. A possible reason for this outcome could be non-linear decision
boundary computed using RBF kernel function. However, for all five algorithms, we can
obtain high recall scores if we are more tolerant to false positives. In the real world, this is
purely a design/business decision and depends on how many false positives is a payments
company willing to tolerate.
TRANSFER
Algorithm Recall Precision F1-measure
Logistic Regression 0.9951 0.4444 0.6144
Linear SVM 0.9951 0.4516 0.6213
SVM with RBF kernel 0.9886 0.5823 0.8949
KNN 0.9328 0.9732 0.9526
Random Forest 0.9975 1.0000 0.9987
CASH OUT
Algorithm Recall Precision F1-measure
Logistic Regression 0.9886 0.1521 0.2636
Linear SVM 0.9411 0.1246 0.6893
SVM with RBF kernel 0.9789 0.1321 0.7271
KNN 0.5674 0.8895 0.6928
Random Forest 0.9927 0.9987 0.9957
Overall, we observe that all our proposed approaches seem to detect fraud
transactions with high accuracy and low false positives - especially for TRANSFER
transactions. With more tolerance to false positives, we can see that it can perform well on
CASH OUT transactions as well.
APPENDIX
GitHub code link - https://github.com/dhhanushvanama/Fraud-Detection
VIII. ACKNOWLEDGEMENT
We would like to extend my sincere gratitude to Professor Thatiparthy Bharath Kumar
and the entire teaching staff for delivering a remarkably well-organized and expertly taught
class. Your dedication and clear instruction have made this learning experience truly
enriching. Thank you for your hard work and commitment.
REFERENCES
[1] A Survey of Credit Card Fraud Detection Techniques: Data and Technique Oriented
Perspective - Samaneh Sorournejad, Zojah, Atani et.al - November 2016
[2] Support Vector machines and malware detection - T.Singh,F.Di Troia, C.Vissagio , Mark
Stamp - San Jose State University - October 2015
[3] Solving the False positives problem in fraud prediction using automated feature
engineering - Wedge, Canter, Rubio et.al - October 2017
[4] PayPal Inc. Quarterly results https://www.paypal.com/stories/us/paypalreports-third-
quarter-2018-results
[5] A Model for Rule Based Fraud Detection in Telecommunications - Rajani,
Padmavathamma - IJERT - 2012
[6] HTTP Attack detection using n−gram analysis - A. Oza, R.Low, M.Stamp - Computers
and Security Journal - September 2014
[7] Scikit learn - machine learning library http://scikit-learn.org
[8] Paysim - Synthetic Financial Datasets For Fraud Detection https://www.kaggle.com/ntnu-
testimon/paysim1
[9] Awoyemi, John O., Adebayo O. Adetunmbi, and Samuel A. Oluwadare. "Credit card fraud
detection using machine learning techniques: A comparative analysis." 2017 international
conference on computing networking and informatics (ICCNI). IEEE, 2017.
[10] Raghavan, Pradheepan, and Neamat El Gayar. "Fraud detection using machine learning
and deep learning." 2019 international conference on computational intelligence and
knowledge economy (ICCIKE). IEEE, 2019.