PaymentsFraudDetectionusingMLmethodsExploringPerformanceEthicalandReal-WorldConsiderationsinMachineLearningbasedFraudDetectionforSecurePayments
PaymentsFraudDetectionusingMLmethodsExploringPerformanceEthicalandReal-WorldConsiderationsinMachineLearningbasedFraudDetectionforSecurePayments
net/publication/373756548
CITATIONS READS
0 496
3 authors, including:
SEE PROFILE
All content following this page was uploaded by Seyed Sahand Mohammadi Ziabari on 08 September 2023.
1 Introduction
The number and severity of cyber-attacks are increasing rapidly these days, with mali-
cious parties often targeting transactions made by banks, financial institutions and pay-
ment service providers (PSPs). The European Central Bank found that in 2019 alone,
the value of fraudulent transactions amounted to over €1.8 billion [1]. Credit card fraud
has thus become a major issue for banks and financial institutions leading to theft of
funds, stolen personal data and massive financial damages to customers and business
owners. Traditional fraud detection methods are often found to be limited by their lack
of adaptability to new & evolving patterns of fraud. The aim of this project is to develop
a fraud detection system using machine learning (ML) techniques to classify fraudulent
2
transactions. This can serve to improve monitoring, response and support many of the
current threat mitigation practices in place at financial institutions. However, there are
challenges associated with use of ML for fraud detection, namely having imbalanced
datasets where the number of legitimate transactions far outnumber fraudulent pay-
ments and can lead to model bias [8, 14]. While there is a growing body of research on
the issue of class imbalance, there is a lack of studies that address the problem using
anomaly detection algorithms [2, 6]. This paper has been done in collaboration with
Deloitte Cyber and involve leveraging their expertise to understand the security land-
scape for payments, including the types of threats faced, tactics used by attackers, indi-
cators for fraud and the selection of appropriate threat mitigation strategies. Through
this paper we are looking to answer the following research question: To what extent
can machine learning (ML) techniques be used to develop methods supporting fraud
detection and security of payments?
2 Related Work
Previous works have shown that ML-based fraud detection can help improve accuracy
and efficiency of threat mitigation by analyzing large amounts of data, identifying fea-
tures and patterns in the data and adapting to them. For example, Mathew et. al inves-
tigate common fraud interface processes and propose a methodology for classifying
fraudulent credit card transactions, applying Logistic Regression, K-NN, Random For-
est and Decision tree methods for fraud detection on a credit card dataset [10]. A similar
study by Asha and Kumar [4] also evaluated SVMs and K-NN alongside artificial neu-
ral networks, achieving high numbers for accuracy, precision and recall. Khine and
Khin propose a novel boosting approach using ensemble methods and apply it for fraud
detection with sample benchmark datasets [7]. However, as introduced there are chal-
lenges associated with the use of ML for fraud detection. For instance, both Kulatilleke
[8] and Tomar et. al [14] address the challenge of having imbalanced fraud data, where
the number of fraudulent transactions is far less than the number of legitimate payments
and can lead to bias. Attempts to solve this problem have mainly looked to resampling
techniques [3, 11], cost-sensitive learning [13] and ensemble methods such as bagging
[14]. However, as mentioned in Section 1, there is a lack of research addressing the
problem of class imbalance using anomaly detection algorithms. These techniques have
various advantages as they can be used to identify instances of the minority class and
give them more weight in the model, effectively balancing the class distribution. This
provides a more balanced dataset for training a machine learning model and helps re-
duce model bias. They can also be used to identify unusual patterns in data and flag
transactions that deviate from normal behavior, which can be a strong indicator for
fraud. Finally, there is need for more research on the interpretability of ML models, as
it is important to understand how a model makes predictions in order to ensure legiti-
mate transactions are not incorrectly flagged as fraudulent. Explainable AI (XAI) tech-
niques can help provide insights into the decision-making process of ML models and
make it more transparent, so that the decisions made by a model can be justified and
trusted by humans.
3
3 Methodology
The method followed for this study relies on a machine learning process for classifying
credit card transactions, followed by research on current approaches to fraud detection
and how they can be supported by these ML techniques.
ues for numerical columns with the bounded mean, accounting for outliers, and imput-
ing null values in categorical columns using the mode. The categorical columns were
then encoded with ‘One-hot encoding‘ to ensure an overall quantitative representation.
Correlation analysis was performed to assess the correlation between the feature col-
umns of the Vesta dataset as well as with the target variable. This allowed us to identify
the most correlated feature pairs. The results of this analysis were used to perform fea-
ture engineering on the dataset, i.e. dropping features based on correlation and gener-
ating new polynomial variables that can help improve performance. At the end of this
phase, the dataset to be used contained 144,233 transactions and 609 feature columns
(including transactionID, transactionDT timestamp, payment amount
(TransactionAmt) and ‘Class‘label). This dataset is used for subsequent
experimentation and can be referred to as the ’base’ dataset. In this process, PCA
(Principal Components Analysis) algorithm was also applied to transform data to a
lower-dimensiona dataset, by identifying the most important underlying features. This
is done by finding the directions, called principal components, that capture the
maximum amount of variation in the data [14].
Outlier detection techniques. In order to account for outliers in the dataset, different
methods were applied to the preprocessed dataset and their results compared to one
another. To start off, the Z-score method (also known as standard score method) was
used to find outliers statistically. As per this approach, the Z-score value is calculated
for each data point by subtracting the mean from the data point and dividing the result
by the standard deviation. This method was used with threshold=4, which corresponds
to 4 standard deviations away from the mean. Data points with Z-scores exceeding this
are classified as outliers [22]. The second technique used to identify outliers in the da-
taset is the Isolation Forest algorithm [17]. This is an unsupervised learning algorithm
that isolates data points using a binary tree structure, calculates the path length required
to isolate each data point in the tree and assigning an anomaly score for each data point
based on average path length [17]. This method was used with basic parameters such
as estimators=100, contamination=0.05, random_state=42. Finally, the third tech-
nique to identify and filter outliers is the Local Outlier Factor (LOF) algorithm [8].
Unlike the previous two, the LOF method chooses outliers based on the density distri-
bution of the data points, measuring local deviation of a point with respect to neighbor-
ing points. A high LOF value would indicate that the data point has lower density com-
pared to its neighbors, suggesting that it is an outlier. This approach was used with
specified parameters neighbors=20 and contamination=0.1.
Data resampling. The steps taken in this phase are to reduce the impact of class im-
balance as well as ensure we use the most important features of the dataset for model
training. SMOTE (Synthetic Minority Over-sampling Technique) [9] was applied to
the preprocessed dataset, to oversample the underrepresented class. The algorithm
works by generating synthetic examples of the minority class (i.e. fraudulent payments)
to balance the label distribution. This helps in reducing bias [9] towards the majority
class (i.e. legitimate payments) and improving the model's ability to reliably classify
minority class instances. This resampling method was used with sampling strat-
egy=0.5, which results in SMOTE producing synthetic samples such that, in the end,
there is a minority class instance for every 2 instances of the majority class. This leads
to a higher amount of fraudulent payments in the dataset which ML models can use to
better learn and classify.
5
Model training and evaluation. In this phase, a set of machine learning (ML) algo-
rithms are trained on the dataset and used to classify transactions - either as 'legitimate'
(0) or 'fraudulent' (1). The models that are used include Logistic regression, Linear
SVC, Multi-layer perceptron (MLP), Decision Tree, Random Forest, AdaBoost and
Gradient Boosting classifiers. Other gradient boosting algorithms namely CatBoost,
LightGBM and XGBoost, are also used for classification. The selection of these models
in our experiment is due to their diverse characteristics and proven success in various
machine learning tasks.
Model refinement. The selected ML classifiers are fine-tuned to optimize for perfor-
mance on the chosen metrics. This involves adjusting hyperparameter settings of the
algorithms and selecting them such that the model performs optimally. GridSearch [7]
is the preferred tool for such experimentation as it can systematically search through a
predefined hyperparameter space and select optimal combinations yielding the best per-
formance. For each of the previously mentioned classifiers, a grid of hyperparameters
is defined, encompassing different values or ranges.
Model interpretation and explainability. To better understand the predictions made
by these ML classifiers, explainable AI (XAI) methods have been used to better inter-
pret the underlying decision-making process. Through these techniques, we will also
investigate which features in the dataset contribute the most for predictions made by
the algorithms. Firstly, LIME [24] generates local explanations by approximating
black-box model predictions, allowing us to identify most relevant features. Then,
SHAP [18] is a game-theoretic approach to explain individual predictions, by assigning
a "credit" score to each feature based on the impact on the model's final prediction. The
approach to validate these trained ML models revolves around understanding real-
world practices and how they can be deployed to support these processes. This can be
broken down into the following steps:
Evaluating models for fairness, accountability & bias. To evaluate these ML classi-
fiers for fairness and bias, the implications they have on algorithmic parity measures
(such as demographic parity, predictive parity, equalized odds etc.) can be studied. It is
important to consider the impact of a model's predictions on protected groups of people.
Since the Vesta dataset is largely blackbox and anonymized to conceal personal details
such as gender, age, region etc., salient groups can instead be created based on the
transaction amount (TransactionAmt). If the results indicate the classifiers are biased,
debiasing methods [12] can be considered such as data augmentation techniques, mod-
ifying the algorithms used or post-processing the output prediction of an algorithm.
Critical Assessment and Deployment Feasibility. This phase involves a study of cur-
rent fraud detection practices in the payments industry and mitigation strategies that are
used by financial institutions to prevent crimes such as money laundering, identity theft,
card skimming etc.
The data strategy followed in this research includes the following stages: data collection
and exploration, preprocessing for learning experiments, model training and evaluation
of results.
6
Exploratory Data Analysis. The Vesta dataset was explored to discover insights on
the transactions recorded and their features. Over 85,000 transactions in the original
dataset were found to be made on desktop compared to mobile devices, which saw
around 55,000 payments. This reflects general consumer habits as well since growth in
mobile conversion rate is sluggish and lagging behind desktop - people are more happy
to browse on mobile but buy ultimately on desktop [27]. As likely another indicator for
this preference, larger screen sizes such as ‘1920 x 1080’ and ‘1366 x 768’ were found
more often in making transactions. It was also found that most transactions are made
on Windows 10 and 7 operating systems which again are desktop-oriented, compared
to mobile OS versions. Notably, among mobile OS, iOS versions recorded a higher
share of transactions compared to Android. Most transactions made were using Visa
(385k) and Mastercard (189k) cards, with American Express (8.3k) and Discover (6.7k)
coming at distant 3rd and 4th spots respectively. Most transactions made were using a
debit card (441k) with significantly less through credit (149k) and charge (15) cards.
Interestingly, the number of fraudulent transactions is roughly the same for both debit
and credit (10k). This would suggest that on average, credit cards are nearly 3x more
susceptible to fraud. The main reason for this could be the size of transactions made
with credit cards. The average transaction amount was found to be 64% higher for credit
card transactions compared to debit payments in the Vesta dataset. This may be because
credit cards are typically more used for work, business or commercial purposes. Addi-
tionally, it was found that much of the original dataset was sparse as many of the col-
umns had missing values. The most sparse features are M3, V4, M1, M2, dist1, M5,
M6, M7, M8, M9, V2, V3, V1, V5, V9, D11, V6, V10, V11, V8, V7 with close to
100\% null values. In order to find the most relevant features in the dataset, the corre-
lation coefficient scores between the feature and the ‘Class’ column were used. The top
10 important features found were V87, V45, V86, V257, V246, V244, V242, V44,
V201 and V200 respectively, with the first 8 all having a correlation above 35\% with
the target variable ‘Class’. A confusion matrix from these scores was also derived to
find pairs of features that are correlated to one another. The top 10 highly inter-corre-
lated feature pairs found are D12-D4, V322-V95, V323-V96, V324-V97, C6-C4,
V293-V279, V101-V95, V322-V101, V322-V279 and V324-V280. After preparation
using correlation analysis results, the missing values were cleared from the dataset and
it was preprocessed to contain 144,233 payments (out of the original 590,540) and 609
columns in total, including the Class label. Out of these, only 11,318 (7.85%) transac-
tions are labelled fraudulent and this also signifies the data is highly imbalanced.
Experimental Setup. In order to collect results, four different experiments were de-
signed to train and evaluate the chosen supervised learning algorithms. These are based
on different combinations of the outlier detection techniques and data resampling ap-
proach discussed in Section 3. In experiment 1, the preprocessed dataset is transformed
using PCA and all ten ML models are subsequently applied in a 80-20 split. In experi-
ment 2, three different outlier detection techniques are first used (separately) to remove
outliers in the preprocessed dataset before PCA and then the supervised learning algo-
rithms are applied. Experiment 3 sees the preprocessed data resampled with SMOTE
prior to PCA and the set of ML algorithms. Finally, experiment 4 involves a combina-
tion of the previous experiments by first removing outliers, then resampling with
7
SMOTE prior to PCA and ML models. In all these experiments, different performance
metrics are measured namely precision, recall, F1, area under ROC curve (AUROC)
and R2 score. While accuracy is widely used in classification problems, it is not a reli-
able metric when the data in question suffers from class imbalance.
5 Result
This section presents the findings from the experimental setup and running the set of
ML classifiers on an unseen subset of the data in order to evaluate performance. The
following subsections examine results found for each of the four experiments con-
ducted to evaluate how well ML models can be used for fraud detection purposes.
SHAP values were used in this experiment, to evaluate for each input feature the impact
it has on the final prediction. Figure 3 displays the summary plot for SHAP analysis of
the XGBoost classifier tested. As per this analysis, the top 5 most important features
found are V0, V11, V3, V1 and V51 with SHAP values of 0.69, 0.21, 0.17, 0.15 and
0.148 respectively.
Demographic parity can be seen [10] as when an equal proportion of positive pre-
dictions are made by a model for each group category. Figure 4 displays an overview
of both metrics for the ML classifiers selected. From the results, it can be seen that the
Random Forest and LightGBM models achieved the highest demographic parity on the
base dataset. This was with values of 0.545 and 0.548 respectively, meaning only little
8
over half the time is the same fraction of transactions predicted 'fraudulent' regardless
of transaction size. This implies the transaction size has a rather large impact on the
proportion of payments that are predicted fraudulent, and the models trained are not
insusceptible to the impacts of payment amount on classification performance.
Fig. 3. SHAP summary for XGBoost model. Fig. 4. Demographic Parity and Equalized Odds
Table 2 (b) displays performance of the selected ML classifiers after applying the Iso-
lation Forest algorithm (as outlined in Section 3) on the dataset. As can be seen above,
the performance of ML classifiers is marginally higher on the dataset with outliers re-
moved using the Isolation Forest algorithm. This can be mainly due to the fact that the
Isolation Forest algorithm (with contamination=0.05) only picked 5% as outliers, while
the Z-score method (with threshold of 4) identified 59.5% as outliers. As a result, there
is less data for the ML models to be trained on in the latter case, which can be the reason
for worse performance on test set. Moreover, the Z-score method assumes the data fol-
lows an underlying Gaussian distribution which is not likely the case, since only 40%
9
of the data here lies within 4 standard deviations of the mean. Table 3 displays perfor-
mance of the selected ML classifiers after applying the Local Outlier Factor algorithm
(as outlined in Section 3) on the preprocessed dataset.
Table 3. Exp. 2 - LOF method results. Fig. 5. Exp. 2 -Plot of metrics for all approaches.
As evidenced from the results, application of LOF algorithm on the data to find outliers
led to superior performance by the selected ML classifiers. This approach with basic
parameters identified roughly 10% of the transactions as outliers. Since it takes into
account the density of neighboring points to determine if a point is an outlier or not,
LOF is more effective for datasets with complex structures. Figure 5 displays an over-
view of all performance metrics for one of the better classifiers, Random Forest, across
the three outlier detection techniques discussed. Interestingly, the recall, F1 and R2
scores can be seen to be highest with the LOF approach (in comparison to the other two
techniques). This could be due to LOF's ability to capture local anomalies which can
result in better recall, as it can detect a higher proportion of true outliers in the dataset.
With improved recall, the ML models are able to identify more positive instances cor-
rectly, which is particularly important in scenarios where the focus is on detecting rare
events or anomalies. LOF's ability to accurately identify local outliers also helps in
reducing false positives (high precision) while still capturing true positives (high re-
call), leading to an improved overall F1 score. Removing or down weighting identified
outliers allows the model to focus on the meaningful patterns and relationships within
the data, leading to a better fit and improved R2 score.
As we can observe, there is not much change in the results upon applying outlier detec-
tion and removal techniques on the resampled dataset. There are a few reasons why this
might be happening; firstly, SMOTE generates synthetic samples to increase the repre-
sentation of the minority class. These synthetic samples are often located in regions
between existing minority class instances. Since outlier detection techniques typically
focus on identifying observations that deviate significantly from the majority of the
data, the synthetic samples generated by SMOTE might not be considered outliers. As
a result, the distribution and characteristics of the outliers in the dataset may remain
relatively unchanged, leading to limited impact on the outlier detection results. The
primary objective of applying SMOTE is to address the issue of class imbalance and
improve representation of the minority class. Outlier detection techniques primarily aim
to identify and handle extreme values or observations that deviate significantly from
the norm. While both techniques deal with anomalies in the dataset, their focus and
objectives are quite distinct. As a result, the impact of applying outlier detection tech-
niques on classification performance is overshadowed by the substantial improvements
achieved through SMOTE in addressing class imbalance.
11
(a) (b)
Fig. 6. (a) Exp. 3 - Plot of metrics post-resampling. (b) Exp. 4 - Random Forest classifier.
As shown in Figures 6 (a) and (b), this can be seen in the numbers of both Random
Forest and XGBoost classifiers trained on the resampled dataset, where values for pre-
cision, recall and F1 remain unchanged for different outlier detection techniques while
area under ROC curve and R2 scores shift slightly. Interestingly, for both of these clas-
sifiers, the Z-score method for removing outliers seems to now marginally outperform
the Isolation Forest and LOF algorithms. This could be because resampling algorithms
such as SMOTE can potentially alter the distribution of the data. The Z-score method
calculates the Z-score for each data point based on the mean and standard deviation of
the original data distribution. If the resampling process significantly affects the mean
and standard deviation, Z-score values of the data points may be recalculated in a way
such that they are more apparent in terms of their deviation from the mean. This can
improve the Z-score method's ability to identify and remove outliers effectively.
6 Conclusion
This paper investigated the use of machine learning (ML) techniques in fraud detection
and payment security. It explores various aspects, including current fraud prevention
approaches, the effectiveness of ML algorithms, the impact of outlier detection and data
resampling techniques, and the applicability of explainable AI (XAI) and Responsible
AI principles. The results indicate that MLPs, Random Forest, CatBoost, and XGBoost
perform well in predicting fraudulent transactions, while Decision Trees perform
poorly. SMOTE effectively addresses class imbalance, and XAI techniques enhance
transparency and understanding of ML models. Responsible AI principles are crucial
for evaluating fairness and bias. The study emphasizes the need for a comprehensive
approach that combines technology, data analysis, and human expertise, highlighting
real-time monitoring, anomaly detection, verification processes, collaboration with law
enforcement, and sharing threat intelligence as important strategies. Leveraging tech-
nology and AI can strengthen fraud detection capabilities, minimize losses, and ensure
secure payment transactions.
References
1. Fraud Detection: An Ultimate Guide for Protecting & Preventing Fraud. https://www.inscribe.ai/fraud-detec-
tion.
2. Deep Dive: How AI and ML Improve Fraud Detection Rates And Reduce False Positives. https://www.fintech-
news.org/how-ai-and-machine-learning can-turn-the-tide-of-fraud
3. Seventh Report on Card Fraud. (29 Oct. 2021). https://www.ecb. europa.eu/pub/cardfraud/html/ecb.card-
fraudreport202110~cac4c418e8.en.html.
4. Charu C Aggarwal. 2015. Outlier analysis. Springer.
5. Noor Alfaiz and Suliman Fati. 2022. Enhanced Credit Card Fraud Detection Model Using Machine Learning.
Electronics 11 (02 2022), 662.
6. R. B. Asha and K.R. Suresh Kumar. 2021. Credit card fraud detection using artificial neural network. Global
Transitions Proceedings 2, 1 (2021), 35–41. 1st International Conference on Advances in Information, Com-
puting and Trends in Data Engineering (AICDE - 2020).
12
7. James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-Parameter Optimization. J. Mach. Learn.
Res. 13, null (feb 2012), 281–305.
8. Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: Identifying Density-
Based Local Outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management
of Data (Dallas, Texas, USA) (SIGMOD ’00). Association for ComputingMachinery, NewYork, NY, USA,
93–104.
9. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-
sampling Technique. Journal of Artificial Intelligence Research 16 (jun 2002), 321–357.
10. Eleni Digalaki. 2022. The impact of artificial intelligence in the banking sector how AI is being used in 2022.
11. Kat Edwards. 2023. Banking Fraud Investigations – How Do Banks Detect Fraud?
12. Michael Feldman, Sorelle A Friedler, Jeremy Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian.
2015. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining. ACM, 259–268.
13. Zhi-Min Huang, Huan Liu, and Artyom Sedrakyan. 2010. Outlier detection in large databases: a wavelet-based
approach. In Proceedings of the 2010 ACM symposium on Applied computing. ACM, 3107–3111.d
14. Ian T. Jolliffe and Jorge Cadima. 2016. Principal component analysis: A review and recent developments.
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374,
2065 (2016), 20150202.
15. Aye Aye Khine and Hint Wint Khin. 2020. Credit Card Fraud Detection Using Online Boosting with Extremely
Fast Decision Tree. In 2020 IEEE Conference on Computer Applications (ICCA). 1–4.
16. Gayan K. Kulatilleke. 2022. Challenges and Complexities in Machine Learning based Credit Card Fraud De-
tection.
17. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation Forest. In 2008 Eighth IEEE International
Conference on Data Mining. 413–422.
18. Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances
in Neural Information Processing Systems. 4765–4774.
19. Jincy C Mathew, B Nithya, C R Vishwanatha, Prathiksha Shetty, H Priya, and G Kavya. 2022. An Analysis
on Fraud Detection in Credit Card Transactions using Machine Learning Techniques. In 2022 Second Interna-
tional Conference on Artificial Intelligence and Smart Energy (ICAIS). 265–272.
20. Niccolo Mejia. 2020. AI-Based Fraud Detection in Banking – Current Applications and Trends.
21. Jelle Oorebeek. 2023. Efficient Bank Fraud Investigations | A Complete Guide.
22. Keith Ord. 1996. Outliers in statistical data: V. Barnett and T. Lewis, 1994, 3rd edition, (John Wiley Sons,
Chichester), pp., [UK pound]55.00, ISBN 0-471- 93094-6. International Journal of Forecasting 12, 1 (1996),
175–176.
23. Andrea Dal Pozzolo and Gianluca Bontempi. 2015. Adaptive Machine Learning for Credit Card Fraud Detec-
tion.
24. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why Should I Trust You?”: Explaining the
Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. ACM, 1135–1144.
25. Yusuf Sahin, Serol Bulkan, and Ekrem Duman. 2013. A cost-sensitive decision tree approach for fraud detec-
tion. Expert Systems with Applications 40 (11 2013), 5916–5923.
26. Pooja Tomar, Sonika Shrivastava, and Urjita Thakar. 2021. Ensemble Learning based Credit Card Fraud De-
tection System. In 2021 5th Conference on Information and Communication Technology (CICT). 1–5.
27. Casey Turnbull. 2023. Why are Mobile Conversion Rates behind Desktop?