END of Year Project Report
END of Year Project Report
Ref: PFA2-2023-09
Of
By
Amira BOUZIDI
Mohamed Iheb BEN ROMDHANE
5, Avenue Taha Hussein – Tunis Tel. : 71 496 066 :الهاتف شارع طه حسين ـ تونس،5
B. P. 56, Bab Menara 1008 Fax : 71 391 166 :فاكس 1008 باب منارة56 :. ب.ص
Dedications
No dedication can express our respect and our consideration for the sacrifices that our
families and our supervisor have made for our education and our well-being in the best
conditions. We thank you for all the support you give us. May this work be the fulfilment of
your well-formulated wishes.
I
Thanks
At the end of this work, we would like to thank our parents who encouraged and
helped to reach this stage of our formation. We would like to thank everyone
who contributed to the completion of this work.
Our thanks go to our supervisor Mrs. ELOUEDI Ines who supported us during
this end-of-year project at the National School of Engineers of Tunis (ENSIT)
for guiding us in carrying out this project. We also express our gratitude for her
continuous encouragement, her availability, and her indulgence vis-à-vis our
suggestions, while keeping a critical eye on our approach. We would like to
thank all those who helped and assisted us during our studies and more
particularly the members of the jury who agreed to judge our work.
Finally, our thanks go to our school, which gave us the opportunity to acquire
professional training.
II
Table of Content
General Introduction........................................................................................VI
III
2.1 Identification of actors............................................................................27
2.2 Use case diagram....................................................................................27
2.3 Text description......................................................................................28
3. Realization and implementation................................................................29
3.1 Technical and technological specification..............................................29
3.2 Overview of the app...............................................................................31
Conclusion.......................................................................................................33
General conclusion...........................................................................................34
IV
Table of figures
V
List of tables
VI
General Introduction
VII
Chapter 1. General Framework
Introduction
The study of a project is a strategic approach that will allow us to have a vision on the
latter, thus aiming to organize the smooth running of the project.
In this first chapter, we present the context of the project as well as a study of the existing
situation followed by the proposed solution.
1
Banks also face legal issues when credit card holders default on their payments. The bank
may have to take legal action against the borrower to recover the outstanding amount. This
can be a time-consuming and costly process.
Until today, there is no final solution for banks put on the ground for the detection of this kind
of problem.
3. Proposed solution
Our mission is then to develop a web application based on a machine learning model
that detects suspicious cases through consumer data.
For that, we utilized data from Kaggle, a community-driven platform for sharing data and
code, to support our analysis in this report.
Kaggle a subsidiary of Google LLC, is an online community of data scientists and machine
learning practitioners.
Kaggle allows users to find and publish data sets, explore and build models in a web-based
data-science environment, work with other data scientists and machine-learning engineers,
and enter competitions to solve data science challenges [1].
Specifically, we used the dataset Default of Credit Card Clients Dataset by the organization
UCI MACHINE LEARNING. We acknowledge the contributions of the data creators and
thank them for making the data publicly available.
2
Conclusion
Throughout this chapter, we have presented the general context of our project as well as
the study of the existing situation and the proposed solution. The following chapter will be
devoted to the phase of collection, study, manipulation and cleaning of data.
3
Chapter 2. Data visualization and pre-processing
Introduction
In this chapter, we present the study developed for the analysis of the data, which will
be presented by category so that the attributes of the same category explain the same
information or certain relevant information. The goal of our analysis is not only to fully
understand each attribute, but also to discover the cross effects between them. Furthermore,
the exploratory analyzes of the data include a large number of graphs and statistical tables, so
we decide to present some interesting findings in this report.
1. Dataset Information
This dataset contains information on default payments, demographic factors, credit data,
history of payment, and bill statements of credit card clients in Taiwan from April 2005 to
September 2005.
4
SEX: Gender (1=male, 2=female)
EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others,
5=unknown, 6=unknown)
MARRIAGE: Marital status (1=married, 2=single, 3=others)
AGE: Age in years
PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one
month, 2=payment delay for two months, … 8=payment delay for eight months,
9=payment delay for nine months and above)
PAY_2: Repayment status in August, 2005 (scale same as above)
PAY_3: Repayment status in July, 2005 (scale same as above)
PAY_4: Repayment status in June, 2005 (scale same as above)
PAY_5: Repayment status in May, 2005 (scale same as above)
PAY_6: Repayment status in April, 2005 (scale same as above)
BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
default.payment.next.month: Default payment (1=yes, 0=no)
5
Figure 3: Data table
This figure shows our raw data before beginning any treatment or cleaning with its different
rows.
2. Data cleaning
Feature engineering is an important step in data analysis because it allows you to perfect the
variables to achieve the desired result, namely a good prediction of the classes. We tried many
treatments on the variables in search of the best result. We then mainly tested two approaches.
The purpose of this approach is to identify and remove rows with illogical values. A row is
classified as having illogical values if its "PAY_X" values (columns 6-11) meet one of the
following criteria:
There are two successive positive values whose difference is greater than 1 (2-month
delay in April followed by a 4+ month delay in May)
There is a negative value ('PAID IN FULL, NO CONSUMPTION') followed by a
value greater than 1 (as the possible choices are -2, -1, 0, or 1)
6
There is an "X" value followed by a succession of zeros followed by a "Y" value
greater than X+1.
This method helped us to remove 3407 illogical rows.
After applying the latest changes to our training and test data. This step mainly consists of
cleaning our dataset and gathering the variables, which makes handling easier.
For that, we drop id column to check duplicate rows.
7
Afterwards, we did a recheck for redundant data and we found none. So, cleaning our dataset
and removing duplicate rows, made us have 27133 rows and 24 columns.
8
Before moving to visualization we first select some features that would be most correlated to
the target variable. From the data provided, we see that we want to predict whether a person
will be in default of payment next month or not. This prediction depends mostly on previous
payment history, limiting balance, age, education and marriage. Let us plot these first.
Our target variable is "Target". Naturally, default payment activity is marked 1, 0 otherwise.
We then have two classes for Target: A positive class, which includes default payment and a
negative one for non-default payment.
The histogram presented in the figure above shows that there is an imbalance in the number of
data between the two classes:
We can see that the dataset consists of 77% clients which are not expected to default payment
whereas 23% clients are expected to default the payment, which poses a problem in the
learning phase because the model will not have enough features to learn for the positive class.
But, we have addressed this problem by using stratified Kfold cross-validation while running
the model.
9
Distribution of clients by sex 3.2
It can be seen that females have very high tendency to default payment compared to males.
Hence, we can keep the SEX column of clients to predict probability of defaulting payment.
The figure below shows us that customers who are between the age of 24 and 36 have mostly
the high tendency to default payment.
10
Figure 8: Distribution of clients by age
This plot makes it possible for us to conclude the age group with most payment default
probability.
11
Figure 9: The correlation map
A correlation map is a powerful visualization tool used to display the strength and direction of
the relationship between the different variables of our dataset. It can be particularly useful in
identifying patterns and relationships between different data points. And as shown, the map
made it possible for us to see the different relations between the bill amounts per each month
amongst themselves and payment per each month amongst themselves too. This gave us the
necessary vision to make the sum of bill amounts and payment amounts to let our model
process the data much more easier.
12
4. Data Pre-processing and feature engineering
Since the range of values of raw data varies widely, in some machine learning algorithms,
objective functions will not work properly without normalization.
Feature scaling is a method used to normalize the range of independent variables or features
of data. In data processing, it is also known as data normalization and is generally performed
during the data pre-processing step.
We processed our data and combined different columns into one as said in the section before.
And then, we used StandardScaler to not affect the results made from the previous
combinations. The StandardScaler function aims to scale our data to make it every data point
13
range from -1 to 1 to eliminate the gap between our columns to make our models process the
dataset more effectively.
Moreover, you can see here our data after applying the suitable changes for our models.
The target column is not affected by the changes because we are dealing with binary results
and we need them to stay like that in order to apply correctly our models.
14
Conclusion
Throughout this chapter, we have presented the steps that lead to having data cleaned
and ready to use within a machine learning model. The next chapter will be devoted to the
model selection and testing phase.
15
Chapter 3. Machine Learning model and tests
Introduction
This chapter is devoted in the first place to the selection of the right model that would
be compatible with our data, starting first by studying the machine learning algorithms and
existing models, choosing the right model and determining the appropriate hyper parameters
and then testing the chosen model.
1. Tested models
We studied several Machine Learning methods that we used for the classification of our data.
We present in this section a presentation of the principle of the tested methods.
16
Before training the model, we split the data set into two pieces — a training set and a testing
set. And then, we used cross-validation to split train dataset on 5 splits and each time train 4
of them and test on the last to help us avoid overfitting and score every time the scores of each
fold and then get the average of their total.
17
After testing the logistic regression model, we obtained a more or less efficient result that can
be improved with an accuracy of 79.9% and AUC score : 0.684
We chose to work with the accuracy metric because we are working on a binary classification
case and that metrics suits us the best to evaluate each model.
18
The XGboost model offers us an implemented function “xgb.cv” that does the cross-
validation for us automatically without us coding it.
After testing the logistic regression model, we obtained a more or less efficient result that can
be improved with an accuracy of 80.34% and precision: 0.559.
XGBoost Classifier algorithm was used to predict the outcome variable, and gives us the
ability to see the feature. The top three most important features were BillSum, Pay and age,
with importance values of 564.0, 354.0 and 303.0, respectively. These features were found to
be highly correlated with the outcome variable and had a significant impact on the accuracy of
the model.
19
FeedForward neural networks (FFNN) 1.3
20
Our neural network is composed from 4 hidden layers and one output layers. Their activation
method is “relu activation” because it helps to reduce the time the neural network takes to
train with approximatively the same precision. We used also BatchNormalization and
Dropout layers to help minimize the risk of over fitting and as always we also performed
cross-validation for the purpose of eliminating that risk too.
After testing the FFNN model, we obtained a more or less efficient result that can be
improved with an accuracy of 80.17% and precision: 0.557.
FFNN algorithm was used to predict the outcome variable and we also were able to calculate
the feature importance of the different columns of our processed dataset. The top three most
important features were PaySum and PAY_AMT, with importance values of 0.30 and 0.25,
respectively. These features were found to be highly correlated with the outcome variable and
had a significant impact on the accuracy of the model.
Thanks to a great research that we carried out during the second part we tried to select the
model which is more compatible with our data and which is mainly used in the problems of
this type.
A random forest is a meta estimator that fits a number of decision tree classifiers on various
sub-samples of the dataset and uses averaging to improve the predictive accuracy and control
over-fitting.[6]
Random forest is a supervised learning algorithm. The "forest" it builds is a set of decision
trees, usually trained with the "bagging" method. The general idea of the bagging method is
that a combination of learning models increases the overall result.
One of the great advantages of random forest is that it can be used for both classification and
regression problems, which make up the majority of machine learning systems today. [7]
22
The randomforestclassifier from sickit-learn contains an implemented method that performs
the stratified cross-validation and also normal cross-validation but we went with the stratified
version because our the percentage of the “1” label and “0” label are not equal.
After testing the RandomForestClassifier model, we obtained a best efficient result with an
accuracy of 80.54% and precision: 0.573.
23
Figure 21: The features importance of RandomForestClassifier
Conclusion
In this chapter we have selected the appropriate model for our data and we have tested the
performance result of our model. In the next chapter we will talk about deploying the model
in a web application.
24
Chapter 4. Model deployment
Introduction
In this chapter, we will present the architecture adopted as well as the hardware and software
environment where our application was developed, then we will end up introducing some
interfaces.
1. Requirement specification
The specification of needs is considered an essential phase in the planning of a project since it
makes it possible to determine and define the customer's requirements.
25
2. Conception
Actor Roles
Administrator Authenticate
Consult dashboard
Consult list of clients
Consult suspicious accounts
Eliminate suspicious accounts
Consult form discussions
26
Figure 22: Use case Diagram
27
3. Realization and implementation
In this part, we will present the architecture adopted as well as the hardware and software
environment where our application was developed, then we will end up introducing some
interfaces.
28
3.1.2 Hardware environment
This application was developed on an Asus machine with the following characteristics:
Characteristic PC
Processor Intel(R) Core(TM) i7
Ram 16 GO
Operating system Microsoft Windows 11
29
Python Python is a high-level, general-
purpose programming language.
Its design philosophy
emphasizes code
readability with the use of
significant indentation via
the off-side rule. [11]
Streamlit Streamlit is an open-source
Python library that makes it easy
to create and share beautiful,
custom web apps for machine
learning and data science.
[12]
Table 4: Software environment
The administrator can authenticate by entering his username and his password correctly.
The figure below shows the login page of our application.
30
3.2.2 Dashboard page
After correctly entering the login data, the admin will be redirected to the dashboard page
where he will find the payment default prediction dashboard and the classification of
costumers by sex or age.
31
3.2.3 List of clients page
The admin can be redirected to the list of clients data where he can click on the “consult
suspicious accounts” button and see the list of clients who are involved in default-payment
activity.
32
The administrator can click on eliminate “eliminate client” button and the suspicious
account is deleted from the list of clients.
The admin can be redirected to the forum discussions page and check the list of client’s
comments.
Conclusion
In this chapter, we have described the hardware and software environment, the design as well
as the architecture adopted to carry out this project, then we have presented some interfaces
and their operation.
33
General conclusion
In conclusion, predicting credit card payment default is a complex task that requires
careful consideration of various factors.
This study provides valuable insights into the use of machine learning techniques for
predicting credit card payment default and highlights the importance of choosing the
appropriate algorithm based on the specific context and limitations of the data.
This report presents our work carried out as part of our end-of-year project in the second year
of computer engineering at the National School of Engineers of Tunis. During this project, we
created a web application based on a machine learning model.
To carry out our project, we had to understand the different payment default techniques as
well as how machine learning works by attending several formations in machine learning,
before being able to set up a system based on the RandomForestClassifier machine learning
algorithm. We all know that machine learning and artificial intelligence help us prevent and
protect against fraud, but we wonder if this area will revolutionize security attacks within
companies?
34
Bibliography
[1] https://en.wikipedia.org/wiki/Kaggle
[2] https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset
[3] https://www.ibm.com/topics/logistic-regression#:~:text=Resources-,What%20is
%20logistic%20regression%3F,given%20dataset%20of%20independent%20variables.
[4] https://xgboost.readthedocs.io/en/stable/
[5] https://en.wikipedia.org/wiki/Feedforward_neural_network
[6]https://scikitlearn.org/stable/modules/generated/
sklearn.ensemble.RandomForestClassifier.html
[8] https://en.wikipedia.org/wiki/Multitier_architecture#Three-tier_architecture
[9] https://en.wikipedia.org/wiki/Kaggle
[10] https://en.wikipedia.org/wiki/Visual_Studio_Code
[11] https://en.wikipedia.org/wiki/Python_(programming_language)
[12] https://docs.streamlit.io/
35
Abstract
The project aims to design a web application that detects credit card payment default activities
using a machine learning model.
Résumé
Le projet a pour objectif de concevoir une application web qui assure la détection des activités
de défaut de payement pour la carte de crédit l’aide d’un modèle de machine Learning.
الملخص
.يهدف المشروع إلى تصميم تطبيق ويب يكتشف األنشطة االفتراضية للدفع ببطاقة االئتمان باستخدام نموذج التعلم اآللي
، الــويب، التطــوير، النمــوذج، الخوارزميــة، الــذكاء االصـطناعي، الــدفع االفتراضــي، التعلم اآللي: الكلمات المفاتيح
بطاقة االئتمان، قاعدة البيانات