Fraud Detection Handbook
Fraud Detection Handbook
net/publication/351283764
CITATIONS READS
25 7,055
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Yann-Aël Le Borgne on 03 May 2021.
Detecting fraud patterns in payment card transactions is known to be a very dif cult problem. With the ever-growing
amount of data generated by payment card transactions, it has become impossible for a human analyst to detect fraudulent
patterns in transaction datasets, often characterized by a large number of samples, many dimensions, and online updates. As
a result, the design of payment card fraud detection techniques has increasingly focused in the last decade on approaches
based on machine learning (ML) techniques, that automate the process of identifying fraudulent patterns from large
volumes of data [PP19,CLBC+19,SSB18,DP15].
The integration of ML techniques in payment card fraud detection systems has greatly improved their ability to more
ef ciently detect frauds, and assist payment processing intermediaries in identifying illicit transactions. Though in recent
years the number of fraudulent transactions kept on increasing, the percentage of losses due to fraud started to decrease in
2016, a reverse trend that is associated with the increasing adoption of ML solutions. [rep19]. On top of helping saving
money, implementing ML-based fraud detection systems is today becoming a must-do for institutions and companies to gain
the trust of their customers.
A widely recognized and recurrent issue in this new eld of ML for card fraud detection is the lack of reproducibility of most
of the research work published on the topic [LJ20,PP19,PL18,ZAM+16]. On the one hand, there is a lack of availability of
payment card transaction data, which cannot be publicly shared for con dentiality reasons. On the other hand, authors do
not make enough efforts to provide their code and make their results reproducible.
This book aims at making a rst step in the direction of reproducibility in the benchmarking of payment card fraud detection
techniques. Due to the vast amount of published research in the domain, it was not possible to exhaustively review and
implement all existing techniques. Rather, we chose to focus on some of the techniques that appeared to us as the most
essential, based on our 10-year collaboration with our industrial partner Worldline.
Some of the techniques presented, such as those dealing with class imbalance, ensembles of models, or concept drift, are
widely acknowledged as being essential parts of the design of a credit card fraud detection system. We additionally cover
less well-documented topics that we think deserve more attention. These include in particular design aspects of the
modeling process, such as the choice of performance metrics and validation strategies, and promising preprocessing and
learning strategies such as feature embeddings, and active and transfer learning.
While the book focuses on payment card fraud, we believe that most of the techniques and discussions presented in this
book can be useful to other practitioners working on the wider topic of fraud detection.
With the reproducibility of experiments as a key driver for this book, the choice of a Jupyter Book format appeared better
suited than a traditional printed book format. First, all the sections of this book that include code are Jupyter notebooks,
which can be executed independently either on the reader’s computer by cloning the book repository, or online using Google
Colab or Binder. Second, the open-source nature of the book - fully available on a public Github repository - allows readers
to open discussions on the book content thanks to Github issues, or to propose amendments or improvements with pull
requests. More importantly, this Jupyter Book format, together with the open-source license, allows any practitioner or
researcher to clone the repository and add content.
License
The code in the notebooks is released under a GNU GPL v3.0 license. The prose and pictures are released under a CC BY-SA
4.0 license.
If you wish to cite this book, you may use the following:
@book{leborgne2021fraud,
title={Machine Learning for Credit Card Fraud Detection - Practical Handbook},
author={Le Borgne, Yann-A{\"e}l and Bontempi, Gianluca},
url={https://github.com/Fraud-Detection-Handbook/fraud-detection-handbook},
year={2021},
publisher={Universit{\'e} Libre de Bruxelles}
}
Authors
Yann-Aël Le Borgne - Main author, contact author ([email protected]).
Gianluca Bontempi - Research design, supervision and monitoring, manuscript revision.
Acknowledgments
This book is the result of ten years of collaboration between the Machine Learning Group, University of Brussels, Belgium
and Worldline.
We wish to thank all the colleagues who worked on this topic during this collaboration: Olivier Caelen (ULB-
MLG/Worldline), Fabrizio Carcillo (ULB-MLG), Guillaume Coter (Worldline), Andrea Dal Pozzolo (ULB-MLG), Jacopo De
Stefani (ULB-MLG), Rémy Fabry (Worldline), Liyun He-Guelton (Worldline), Bertrand Lebichot (ULB-MLG), Gian Marco
Paldino (ULB-MLG), Wissam Siblini (Worldline), Théo Verhelst (ULB-MLG).
The collaboration was made possible thanks to Innoviris, the Brussels Region Institute for Research and Innovation, through
a series of grants which started in 2012 and ended in 2021.
2018 to 2021. DefeatFraud: Assessment and validation of deep feature engineering and learning solutions for fraud detection.
Innoviris Team Up Programme.
2015 to 2018. BruFence: Scalable machine learning for automating defense system. Innoviris Bridge Programme.
2012 to 2015. Adaptive real-time machine learning for credit card fraud detection. Innoviris Doctiris Programme.
Bibliography
[AA17]
Aderemi O Adewumi and Andronicus A Akinyelu. A survey of machine-learning and nature-inspired based credit card
fraud detection techniques. International Journal of System Assurance Engineering and Management, 8(2):937–953,
2017.
[Ban20]
European Central Bank. 6th report on card fraud. August 2020. [Online; Last consulted 09-October-2020]. URL:
https://www.ecb.europa.eu/pub/cardfraud/html/ecb.cardfraudreport202008~521edb602b.en.html#toc2.
[BTK+21]
Marijan Beg, Juliette Taka, Thomas Kluyver, Alexander Konovalov, Min Ragan-Kelley, Nicolas M Thiéry, and Hans
Fangohr. Using jupyter for reproducible scienti c work ows. Computing in Science & Engineering, 23(2):36–46, 2021.
[BB12]
James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of machine learning
research, 2012.
[Bis06]
Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
[Bon21]
Gianluca Bontempi. Statistical foundations of machine learning, 2nd edition. Université Libre de Bruxelles, 2021.
[BEP13]
Kendrick Boyd, Kevin H Eng, and C David Page. Area under the precision-recall curve: point estimates and con dence
intervals. In Joint European conference on machine learning and knowledge discovery in databases, 451–466. Springer,
2013.
[Car18]
Fabrizio Carcillo. Beyond Supervised Learning in Credit Card Fraud Detection: A Dive into Semi-supervised and Distributed
Learning. Université libre de Bruxelles, 2018.
[CDPLB+18]
Fabrizio Carcillo, Andrea Dal Pozzolo, Yann-Aël Le Borgne, Olivier Caelen, Yannis Mazzer, and Gianluca Bontempi.
Scarff: a scalable framework for streaming credit card fraud detection with spark. Information fusion, 41:182–194,
2018.
[CLBCB18]
Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, and Gianluca Bontempi. Streaming active learning strategies
for real-life credit card fraud detection: assessment and visualization. International Journal of Data Science and
Analytics, 5(4):285–300, 2018.
[CLBC+19]
Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Yacine Kessaci, Frédéric Oblé, and Gianluca Bontempi.
Combining unsupervised and supervised learning in credit card fraud detection. Information Sciences, 2019.
[CTMozetivc20]
Vitor Cerqueira, Luis Torgo, and Igor Mozetič. Evaluating time series forecasting models: an empirical study on
performance estimation methods. Machine Learning, 109(11):1997–2028, 2020.
[Cha09]
Nitesh V Chawla. Data mining for imbalanced datasets: an overview. In Data mining and knowledge discovery handbook,
pages 875–886. Springer, 2009.
[CCHJ08]
Nitesh V Chawla, David A Cieslak, Lawrence O Hall, and Ajay Joshi. Automatically countering imbalance and its
empirical relationship to cost. Data Mining and Knowledge Discovery, 17(2):225–252, 2008.
[CJK04]
Nitesh V Chawla, Nathalie Japkowicz, and Aleksander Kotcz. Special issue on learning from imbalanced data sets.
ACM SIGKDD explorations newsletter, 6(1):1–6, 2004.
[CLB+04]
Chao Chen, Andy Liaw, Leo Breiman, and others. Using random forest to learn imbalanced data. University of
California, Berkeley, 110(1-12):24, 2004.
[DP15]
Andrea Dal Pozzolo. Adaptive machine learning for credit card fraud detection. Université libre de Bruxelles, 2015.
[DPBC+17]
Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi. Credit card fraud
detection: a realistic modeling and a novel learning strategy. IEEE transactions on neural networks and learning systems,
29(8):3784–3797, 2017.
[DPCLB+14]
Andrea Dal Pozzolo, Olivier Caelen, Yann-Ael Le Borgne, Serge Waterschoot, and Gianluca Bontempi. Learned
lessons in credit card fraud detection from a practitioner perspective. Expert systems with applications, 41(10):4915–
4928, 2014.
[DG06]
Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In Proceedings of the 23rd
international conference on Machine learning, 233–240. 2006.
[Elk01]
Charles Elkan. The foundations of cost-sensitive learning. In International joint conference on arti cial intelligence,
volume 17, 973–978. Lawrence Erlbaum Associates Ltd, 2001.
[FZ11]
Guangzhe Fan and Mu Zhu. Detection of rare items with target. Statistics and Its Interface, 4(1):11–17, 2011.
[Faw04]
Tom Fawcett. Roc graphs: notes and practical considerations for researchers. Machine learning, 31(1):1–38, 2004.
[Faw06]
Tom Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861–874, 2006.
[FernandezGarciaG+18]
Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C Prati, Bartosz Krawczyk, and Francisco Herrera. Learning
from imbalanced data sets. Springer, 2018.
[FK15]
Peter Flach and Meelis Kull. Precision-recall-gain curves: pr analysis done right. In Advances in neural information
processing systems, 838–846. 2015.
[FHT01]
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Volume 1. Springer series in
statistics New York, 2001.
[GvZliobaiteB+14]
João Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drift
adaptation. ACM computing surveys (CSUR), 46(4):1–37, 2014.
[Ins18]
Statistic Brain Research Institute. Credit card fraud statistics. April 2018. [Online; Last consulted 30-March-2021].
URL: https://www.statisticbrain.com/credit-card-fraud-statistics/.
[Kag16]
Kaggle. Credit card fraud detection dataset. November 2016. [Online; Last consulted 09-March-2021]. URL:
https://www.kaggle.com/mlg-ulb/creditcardfraud.
[Kri10]
M. Krivko. A hybrid model for plastic card fraud detection systems. Expert Systems with Applications, 37(8):6070 –
6076, 2010. URL: http://www.sciencedirect.com/science/article/pii/S0957417410001582,
doi:https://doi.org/10.1016/j.eswa.2010.02.119.
[LLBHG+19]
Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He-Guelton, Frédéric Oblé, and Gianluca Bontempi. Deep-learning
domain adaptation techniques for credit cards fraud detection. In INNS Big Data and Deep Learning conference, 78–88.
Springer, 2019.
[LemaitreNA17]
Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. Imbalanced-learn: a python toolbox to tackle the
curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17):1–5, 2017. URL:
http://jmlr.org/papers/v18/16-365.html.
[LJ20]
Yvan Lucas and Johannes Jurgovsky. Credit card fraud detection using machine learning: a survey. arXiv preprint
arXiv:2010.06479, 2020.
[McK17]
Wes McKinney. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython - 2nd Edition. O'Reilly Media,
Inc., 2017.
[MekterovicBrkicBaranovic18]
Igor Mekterović, Ljiljana Brkić, and Mirta Baranović. A systematic review of data mining approaches to credit card
fraud detection. WSEAS Transactions on Business and Economics, 15:437–444, 2018.
[Mus19]
John Muschelli. Roc and auc with a binary predictor: a potentially misleading metric. Journal of Classi cation, pages 1–
13, 2019.
[MullerG16]
Andreas C Müller and Sarah Guido. Introduction to machine learning with Python: a guide for data scientists. O'Reilly
Media, Inc., 2016.
[PL18]
Vipul Patil and Umesh Kumar Lilhore. A survey on different data mining & machine learning methods for credit card
fraud detection. International Journal of Scienti c Research in Computer Science, Engineering and Information Technology,
3(5):320–325, 2018.
View publication stats
[PC18]
Rimpal R Popat and Jayesh Chaudhary. A survey on credit card fraud detection using machine learning. In 2018 2nd
International Conference on Trends in Electronics and Informatics (ICOEI), 1120–1125. IEEE, 2018.
[PP19]
C Victoria Priscilla and D Padma Prabha. Credit card fraud detection: a systematic review. In International Conference
on Information, Communication and Computing Technology, 290–303. Springer, 2019.
[rep19]
Nilson report. Nilson report issue 1164. November 2019. [Online; Last consulted 09-October-2020]. URL:
https://nilsonreport.com/upload/content_promo/The_Nilson_Report_Issue_1164.pdf .
[SSB18]
Imane Sadgali, Nawal Sael, and Faouzia Benabbou. Detection of credit card fraud: state of art. International Journal of
computer science and network security, 18(11):76–83, 2018.
[SR15]
Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the roc plot when evaluating
binary classi ers on imbalanced datasets. PloS one, 10(3):e0118432, 2015.
[SKK18]
Janvier Omar Sinayobye, Fred Kiwanuka, and Swaib Kaawaase Kyanda. A state-of-the-art review of machine learning
techniques for fraud detection research. In 2018 IEEE/ACM Symposium on Software Engineering in Africa (SEiA), 11–19.
IEEE, 2018.
[Tha20]
Alaa Tharwat. Classi cation assessment methods. Applied Computing and Informatics, 2020.
[VVBC+15]
Véronique Van Vlasselaer, Cristián Bravo, Olivier Caelen, Tina Eliassi-Rad, Leman Akoglu, Monique Snoeck, and Bart
Baesens. Apate: a novel approach for automated credit card transaction fraud detection using network-based
extensions. Decision Support Systems, 75:38–48, 2015.
[WHJ+09]
Christopher Whitrow, David J Hand, Piotr Juszczak, David Weston, and Niall M Adams. Transaction aggregation as a
strategy for credit card fraud detection. Data mining and knowledge discovery, 18(1):30–55, 2009.
[YAG19]
Niloofar Youse , Marie Alaghband, and Ivan Garibay. A comprehensive survey on machine learning techniques and
user authentication approaches for credit card fraud detection. arXiv preprint arXiv:1912.02629, 2019.
[ZAM+16]
Zahra Zojaji, Reza Ebrahimi Atani, Amir Hassan Monadjemi, and others. A survey of credit card fraud detection
techniques: data and technique oriented perspective. arXiv preprint arXiv:1611.06439, 2016.
Code released under a GNU GPL v3.0 license. Prose and pictures released under a CC BY-SA 4.0 license.