0% found this document useful (0 votes)

86 views44 pages

END of Year Project Report

The document is an end-of-year project report comparing machine learning techniques for credit card payment default prediction. It introduces the topic, analyzes the dataset, tests models like logistic regression, XGBoost and neural networks, and selects Random Forest Classifier. It then specifies requirements and designs an app to deploy the model.

Uploaded by

Mariem Sayedi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views44 pages

END of Year Project Report

Uploaded by

Mariem Sayedi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 44

‫الجمهورية التونسية‬

‫وزارة التعليم العالي والبحث العلمي‬

‫جامعة تونس‬
‫المدرسة الوطنية العليـا للمهندسين بتونـس‬

Ref: PFA2-2023-09

End of Year Project Report

Second year in Computer Engineering

Presented and publicly defended on 10/05/2023

Amira BOUZIDI
Mohamed Iheb BEN ROMDHANE

Comparison of machine learning techniques for the

prediction of credit card payment default

Composition of the jury

Mrs BEN AZZOUZ Lamia President

Mrs ELOUEDI Ines Supervisor

Academic Year: 2022-2023

5, Avenue Taha Hussein – Tunis Tel. : 71 496 066 :‫الهاتف‬ ‫ شارع طه حسين ـ تونس‬،5
B. P. 56, Bab Menara 1008 Fax : 71 391 166 :‫فاكس‬ 1008 ‫ باب منارة‬56 :.‫ ب‬.‫ص‬
Dedications

No dedication can express our respect and our consideration for the sacrifices that our
families and our supervisor have made for our education and our well-being in the best
conditions. We thank you for all the support you give us. May this work be the fulfilment of
your well-formulated wishes.

TO ALL THE PEOPLE WHO PARTICIPATED IN THE DEVELOPMENT OF THIS WORK.

I
Thanks

At the end of this work, we would like to thank our parents who encouraged and
helped to reach this stage of our formation. We would like to thank everyone
who contributed to the completion of this work.

Our thanks go to our supervisor Mrs. ELOUEDI Ines who supported us during
this end-of-year project at the National School of Engineers of Tunis (ENSIT)
for guiding us in carrying out this project. We also express our gratitude for her
continuous encouragement, her availability, and her indulgence vis-à-vis our
suggestions, while keeping a critical eye on our approach. We would like to
thank all those who helped and assisted us during our studies and more
particularly the members of the jury who agreed to judge our work.

Finally, our thanks go to our school, which gave us the opportunity to acquire
professional training.

II
Table of Content

General Introduction........................................................................................VI

Chapter 1. General Framework.............................................................................1

Introduction........................................................................................................1
1. Presentation of the subject..........................................................................1
2. Study of the existing...................................................................................1
3. Proposed solution........................................................................................2
Conclusion.........................................................................................................3

Chapter 2. Data visualization and pre-processing.................................................4

Introduction........................................................................................................4
1. Dataset Information.....................................................................................4
1.1 Presentation of data..................................................................................4
1.2 Presentation of variables...........................................................................4
1.3 Data size...................................................................................................5
2. Data cleaning...............................................................................................6
3. Detailed presentation of the data.................................................................8
3.1 Variable Target.........................................................................................8
3.2 Distribution of clients by sex....................................................................9
3.3 Distribution of clients by age.................................................................10
3.4 The correlation of the dataset.................................................................11
4. Data Pre-processing and feature engineering............................................13
Conclusion.......................................................................................................15

Chapter 3. Machine Learning model and tests....................................................16

Introduction......................................................................................................16
1. Tested models...........................................................................................16
1.1 Logistic regression..................................................................................16
1.2 XGBoost Classifier.................................................................................18
1.3 FeedForward neural networks (FFNN)..................................................20
2. Model used: RandomForestClassifier.......................................................23
Conclusion.......................................................................................................25

Chapter 4. Model deployment.............................................................................26

Introduction......................................................................................................26
1. Requirement specification.........................................................................26
1.1 Functional needs.....................................................................................26
1.2 Non-functional needs..............................................................................26
2. Conception................................................................................................27

III
2.1 Identification of actors............................................................................27
2.2 Use case diagram....................................................................................27
2.3 Text description......................................................................................28
3. Realization and implementation................................................................29
3.1 Technical and technological specification..............................................29
3.2 Overview of the app...............................................................................31
Conclusion.......................................................................................................33

General conclusion...........................................................................................34

IV
Table of figures

Figure 1: Logo of Kaggle...........................................................................................................2

Figure 2: Shape of the data........................................................................................................4
Figure 3: Data table...................................................................................................................6
Figure 4: Data cleaning..............................................................................................................7
Figure 5: Clean data table..........................................................................................................8
Figure 6: Histogram of predictions............................................................................................9
Figure 7: Distribution of clients by sex....................................................................................10
Figure 8: Distribution of clients by age....................................................................................11
Figure 9: The correlation map.................................................................................................12
Figure 10: Data Pre-processing and feature engineering.......................................................13
Figure 11: Logistic regression model.......................................................................................16
Figure 12: Result of Logistic Regression Model......................................................................18
Figure 13: XGBoost Classifier algorithm.................................................................................18
Figure 14: Result of XGBoost Classifier algorithm.................................................................19
Figure 15: Features importance of XGBoost Classifier...........................................................19
Figure 16: FeedForward neural networks (FFNN) model.......................................................20
Figure 17: The result of FeedForward neural networks (FFNN) model.................................21
Figure 18: Features importance of FeedForward neural networks (FFNN) model................21
Figure 19: RandomForestClassifier model..............................................................................22
Figure 20: The result of RandomForestClassifier model.........................................................23
Figure 21: The features importance of RandomForestClassifier.............................................24
Figure 22: Use case Diagram...................................................................................................27
Figure 23: Multi-tier architecture............................................................................................28
Figure 24: Login page..............................................................................................................30
Figure 25: Dashboard page......................................................................................................31
Figure 26: List of clients page..................................................................................................32
Figure 27: List of suspicious accounts page............................................................................32
Figure 28: Forum discussions page........................................................................................33

V
List of tables

Table 1: Actors and their roles.................................................................................................26

Table 2: Text description of use case: “Eliminate suspicious accounts”................................27
Table 3: Hardware environment..............................................................................................29
Table 4: Software environment................................................................................................30

VI
General Introduction

Default payment is a significant challenge faced by banks and other financial

institutions worldwide. When a customer fails to pay back their credit card balance on time, it
results in a financial loss for the bank. In addition, if the customer defaults on their credit card
payments, it can negatively impact their credit score, making it difficult for them to obtain
credit in the future. Default rates can be influenced by various factors such as economic
conditions, personal financial circumstances, and credit history. For banks and financial
institutions, predicting which customers are likely to default on their credit card payments is a
crucial task in risk management and credit assessment. By identifying customers who are at a
higher risk of defaulting, banks can take proactive measures to minimize their financial losses,
such as reducing the credit limit, increasing the interest rate, or taking legal action.
Traditionally, banks have used statistical models and rule-based systems to predict default
rates. However, with the advent of machine learning techniques, banks can now train
sophisticated models that can better capture the complex relationships between various factors
that influence default rates. Machine learning models can also be trained to continuously learn
from new data, improving their accuracy over time. By training machine learning models to
predict default payment cards, banking corporations can not only minimize their financial
losses but also enhance their customer experience. Banks can use the model's predictions to
offer targeted financial advice to customers at risk of defaulting, such as debt consolidation or
payment plans, thus reducing the burden of debt and improving the customer's financial
wellbeing.

VII
Chapter 1. General Framework

Introduction

The study of a project is a strategic approach that will allow us to have a vision on the
latter, thus aiming to organize the smooth running of the project.
In this first chapter, we present the context of the project as well as a study of the existing
situation followed by the proposed solution.

1. Presentation of the subject

Stock up on groceries, pay all of your bills, pay rent, or pay off a home loan or other
credit. Nowadays, expenses accumulate and for some, making to the end of the month
becomes very complicated and even not impossible.
Also, the use of credit cards has become increasingly popular over the years. It has provided
people with a convenient way of making purchases without having to carry cash around.
Exceeding your overdraft, making a non-sufficient funds NSF cheque, is what is called a
payment default of credit card.
In our project, we address the issue of detecting payment defaults without having to travel to
the field. We propose to use Machine Learning methods to detect suspicious cases through
consumer data and to develop a web application based on this machine learning model to
detect payment defaults from the data available.

2. Study of the existing

The study of the existing proves to be of undeniable importance. It makes it possible to
put on the table, and in a very clear way, the studies carried out on the development
environment of the project, whether in relation to the target market or the competitors who
occupy it. We must therefore focus on this phase which can save us complications after the
launch of our project.
When a credit card holder is unable to make the required payments, it can lead to a series of
problems for the bank that issued the credit card.
The primary problem that banks face due to credit card defaults is huge financial losses.
When a credit card holder defaults on their payments, the bank is left with unpaid debts that
can quickly accumulate over time. This can lead to a significant loss of revenue for the bank.

1
Banks also face legal issues when credit card holders default on their payments. The bank
may have to take legal action against the borrower to recover the outstanding amount. This
can be a time-consuming and costly process.

Until today, there is no final solution for banks put on the ground for the detection of this kind
of problem.

3. Proposed solution
Our mission is then to develop a web application based on a machine learning model
that detects suspicious cases through consumer data.
For that, we utilized data from Kaggle, a community-driven platform for sharing data and
code, to support our analysis in this report.

Kaggle a subsidiary of Google LLC, is an online community of data scientists and machine
learning practitioners.
Kaggle allows users to find and publish data sets, explore and build models in a web-based
data-science environment, work with other data scientists and machine-learning engineers,
and enter competitions to solve data science challenges [1].

Figure 1: Logo of Kaggle

Specifically, we used the dataset Default of Credit Card Clients Dataset by the organization
UCI MACHINE LEARNING. We acknowledge the contributions of the data creators and
thank them for making the data publicly available.

At the end of our work, we must be able to:

 Detect default credit card with a fairly low error rate using a machine learning model
 Detect suspicious cases through consumer data in a web application
 Plot model performance graphs

2
Conclusion

Throughout this chapter, we have presented the general context of our project as well as
the study of the existing situation and the proposed solution. The following chapter will be
devoted to the phase of collection, study, manipulation and cleaning of data.

3
Chapter 2. Data visualization and pre-processing

Introduction
In this chapter, we present the study developed for the analysis of the data, which will
be presented by category so that the attributes of the same category explain the same
information or certain relevant information. The goal of our analysis is not only to fully
understand each attribute, but also to discover the cross effects between them. Furthermore,
the exploratory analyzes of the data include a large number of graphs and statistical tables, so
we decide to present some interesting findings in this report.

1. Dataset Information
This dataset contains information on default payments, demographic factors, credit data,
history of payment, and bill statements of credit card clients in Taiwan from April 2005 to
September 2005.

Presentation of data 1.1

Files are available for download: ' https://www.kaggle.com/datasets/uciml/default-of-
credit-card-clients-dataset'

Figure 2: Shape of the data

The figure above shows the shape of our data to let us know what we are dealing with.

Presentation of variables 1.2

There are 25 variables: [2]

 ID: ID of each client

 LIMIT_BAL: Amount of given credit in NT dollars (includes individual and
family/supplementary credit

4
 SEX: Gender (1=male, 2=female)
 EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others,
5=unknown, 6=unknown)
 MARRIAGE: Marital status (1=married, 2=single, 3=others)
 AGE: Age in years
 PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one
month, 2=payment delay for two months, … 8=payment delay for eight months,
9=payment delay for nine months and above)
 PAY_2: Repayment status in August, 2005 (scale same as above)
 PAY_3: Repayment status in July, 2005 (scale same as above)
 PAY_4: Repayment status in June, 2005 (scale same as above)
 PAY_5: Repayment status in May, 2005 (scale same as above)
 PAY_6: Repayment status in April, 2005 (scale same as above)
 BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
 BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
 BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
 BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
 BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
 BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
 PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
 PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
 PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
 PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
 PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
 PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
 default.payment.next.month: Default payment (1=yes, 0=no)

Data size 1.3

We copy this dataset df1 into another datacopy dataset, which will be used for all further
computations.
First, look at some of the data to check if data has been read correctly. But before that, we
made a copy of our data to ensure that we don’t damage it while running different scripts on
it. See figure below.

5
Figure 3: Data table

This figure shows our raw data before beginning any treatment or cleaning with its different
rows.

2. Data cleaning
Feature engineering is an important step in data analysis because it allows you to perfect the
variables to achieve the desired result, namely a good prediction of the classes. We tried many
treatments on the variables in search of the best result. We then mainly tested two approaches.
The purpose of this approach is to identify and remove rows with illogical values. A row is
classified as having illogical values if its "PAY_X" values (columns 6-11) meet one of the
following criteria:
 There are two successive positive values whose difference is greater than 1 (2-month
delay in April followed by a 4+ month delay in May)
 There is a negative value ('PAID IN FULL, NO CONSUMPTION') followed by a
value greater than 1 (as the possible choices are -2, -1, 0, or 1)

6
 There is an "X" value followed by a succession of zeros followed by a "Y" value
greater than X+1.
This method helped us to remove 3407 illogical rows.

Figure 4: Data cleaning

After applying the latest changes to our training and test data. This step mainly consists of
cleaning our dataset and gathering the variables, which makes handling easier.
For that, we drop id column to check duplicate rows.

7
Afterwards, we did a recheck for redundant data and we found none. So, cleaning our dataset
and removing duplicate rows, made us have 27133 rows and 24 columns.

Figure 5: Clean data table

3. Detailed presentation of the data
Variable Target 3.1

Figure 6: Renaming the target

8
Before moving to visualization we first select some features that would be most correlated to
the target variable. From the data provided, we see that we want to predict whether a person
will be in default of payment next month or not. This prediction depends mostly on previous
payment history, limiting balance, age, education and marriage. Let us plot these first.

Figure 6: Histogram of predictions

Our target variable is "Target". Naturally, default payment activity is marked 1, 0 otherwise.
We then have two classes for Target: A positive class, which includes default payment and a
negative one for non-default payment.
The histogram presented in the figure above shows that there is an imbalance in the number of
data between the two classes:
We can see that the dataset consists of 77% clients which are not expected to default payment
whereas 23% clients are expected to default the payment, which poses a problem in the
learning phase because the model will not have enough features to learn for the positive class.
But, we have addressed this problem by using stratified Kfold cross-validation while running
the model.

9
Distribution of clients by sex 3.2

Figure 7: Distribution of clients by sex

It can be seen that females have very high tendency to default payment compared to males.
Hence, we can keep the SEX column of clients to predict probability of defaulting payment.

Distribution of clients by age 3.3

The figure below shows us that customers who are between the age of 24 and 36 have mostly
the high tendency to default payment.

10
Figure 8: Distribution of clients by age

This plot makes it possible for us to conclude the age group with most payment default
probability.

The correlation of the dataset 3.4

The figure below shows the correlation matrix between our different variables. Indeed, this
matrix makes it possible to identify the variables that are cross-related, which helps us
considerably in the data engineering phase.
The clearer the box of intersection between two variables, the higher the correlation rate, we
can notice that the repayment status variable in a month clearly depends on the repayment
status variables in other months etc…

11
Figure 9: The correlation map

A correlation map is a powerful visualization tool used to display the strength and direction of
the relationship between the different variables of our dataset. It can be particularly useful in
identifying patterns and relationships between different data points. And as shown, the map
made it possible for us to see the different relations between the bill amounts per each month
amongst themselves and payment per each month amongst themselves too. This gave us the
necessary vision to make the sum of bill amounts and payment amounts to let our model
process the data much more easier.

12
4. Data Pre-processing and feature engineering
Since the range of values of raw data varies widely, in some machine learning algorithms,
objective functions will not work properly without normalization.
Feature scaling is a method used to normalize the range of independent variables or features
of data. In data processing, it is also known as data normalization and is generally performed
during the data pre-processing step.

Figure 10: Data Pre-processing and feature engineering

We processed our data and combined different columns into one as said in the section before.
And then, we used StandardScaler to not affect the results made from the previous
combinations. The StandardScaler function aims to scale our data to make it every data point

13
range from -1 to 1 to eliminate the gap between our columns to make our models process the
dataset more effectively.

Moreover, you can see here our data after applying the suitable changes for our models.

The target column is not affected by the changes because we are dealing with binary results
and we need them to stay like that in order to apply correctly our models.

14
Conclusion
Throughout this chapter, we have presented the steps that lead to having data cleaned
and ready to use within a machine learning model. The next chapter will be devoted to the
model selection and testing phase.

15
Chapter 3. Machine Learning model and tests

Introduction

This chapter is devoted in the first place to the selection of the right model that would
be compatible with our data, starting first by studying the machine learning algorithms and
existing models, choosing the right model and determining the appropriate hyper parameters
and then testing the chosen model.

1. Tested models
We studied several Machine Learning methods that we used for the classification of our data.
We present in this section a presentation of the principle of the tested methods.

Logistic regression 1.1

This type of statistical model is often used for classification and predictive analytics. Logistic
regression estimates the probability of an event occurring, based on a given dataset of
independent variables. [3]

Figure 11: Logistic regression model

16
Before training the model, we split the data set into two pieces — a training set and a testing
set. And then, we used cross-validation to split train dataset on 5 splits and each time train 4
of them and test on the last to help us avoid overfitting and score every time the scores of each
fold and then get the average of their total.

17
After testing the logistic regression model, we obtained a more or less efficient result that can
be improved with an accuracy of 79.9% and AUC score : 0.684

Figure 12: Result of Logistic Regression Model

We chose to work with the accuracy metric because we are working on a binary classification
case and that metrics suits us the best to evaluate each model.

XGBoost Classifier 1.2

XGBoost is an optimized distributed gradient boosting library designed to be

highly efficient, flexible and portable. It implements machine-learning algorithms under
the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as
GBDT, GBM) that solve many data science problems in a fast and accurate way. [4]

Figure 13: XGBoost Classifier algorithm

18
The XGboost model offers us an implemented function “xgb.cv” that does the cross-
validation for us automatically without us coding it.
After testing the logistic regression model, we obtained a more or less efficient result that can
be improved with an accuracy of 80.34% and precision: 0.559.

Figure 14: Result of XGBoost Classifier algorithm

XGBoost Classifier algorithm was used to predict the outcome variable, and gives us the
ability to see the feature. The top three most important features were BillSum, Pay and age,
with importance values of 564.0, 354.0 and 303.0, respectively. These features were found to
be highly correlated with the outcome variable and had a significant impact on the accuracy of
the model.

Figure 15: Features importance of XGBoost Classifier

19
FeedForward neural networks (FFNN) 1.3

A feedforward neural network (FNN) is an artificial neural network wherein connections

between the nodes do not form a cycle. As such, it is different from its descendant: recurrent
neural networks.
The feedforward neural network was the first and simplest type of artificial neural network
devised. In this network, the information moves in only one direction—forward—from the
input nodes, through the hidden nodes (if any) and to the output nodes [5] .

Figure 16: FeedForward neural networks (FFNN)

model

20
Our neural network is composed from 4 hidden layers and one output layers. Their activation
method is “relu activation” because it helps to reduce the time the neural network takes to
train with approximatively the same precision. We used also BatchNormalization and
Dropout layers to help minimize the risk of over fitting and as always we also performed
cross-validation for the purpose of eliminating that risk too.
After testing the FFNN model, we obtained a more or less efficient result that can be
improved with an accuracy of 80.17% and precision: 0.557.

Figure 17: The result of FeedForward neural

networks (FFNN) model

FFNN algorithm was used to predict the outcome variable and we also were able to calculate
the feature importance of the different columns of our processed dataset. The top three most
important features were PaySum and PAY_AMT, with importance values of 0.30 and 0.25,
respectively. These features were found to be highly correlated with the outcome variable and
had a significant impact on the accuracy of the model.

Figure 18: Features importance of FeedForward neural networks (FFNN) model

21
This plot helps us focus more on the important features and discard those who are less
important to better train our model and minimize its time consumption.

2. Model used: RandomForestClassifier

Thanks to a great research that we carried out during the second part we tried to select the
model which is more compatible with our data and which is mainly used in the problems of
this type.
A random forest is a meta estimator that fits a number of decision tree classifiers on various
sub-samples of the dataset and uses averaging to improve the predictive accuracy and control
over-fitting.[6]
Random forest is a supervised learning algorithm. The "forest" it builds is a set of decision
trees, usually trained with the "bagging" method. The general idea of the bagging method is
that a combination of learning models increases the overall result.
One of the great advantages of random forest is that it can be used for both classification and
regression problems, which make up the majority of machine learning systems today. [7]

Figure 19: RandomForestClassifier model

22
The randomforestclassifier from sickit-learn contains an implemented method that performs
the stratified cross-validation and also normal cross-validation but we went with the stratified
version because our the percentage of the “1” label and “0” label are not equal.
After testing the RandomForestClassifier model, we obtained a best efficient result with an
accuracy of 80.54% and precision: 0.573.

Figure 20: The result of RandomForestClassifier

model
RandomForestClassifier algorithm was used to predict the outcome variable. We also
performed the feature importance to see what columns to focus on more and what others to
discard The top three most important features were PaySum and Bill_Sum, with importance
values of 0.175 and 0.12, respectively. These features were found to be highly correlated with
the outcome variable and had a significant impact on the accuracy of the model.

23
Figure 21: The features importance of RandomForestClassifier

Conclusion
In this chapter we have selected the appropriate model for our data and we have tested the
performance result of our model. In the next chapter we will talk about deploying the model
in a web application.

24
Chapter 4. Model deployment

Introduction
In this chapter, we will present the architecture adopted as well as the hardware and software
environment where our application was developed, then we will end up introducing some
interfaces.

1. Requirement specification
The specification of needs is considered an essential phase in the planning of a project since it
makes it possible to determine and define the customer's requirements.

Functional needs 1.1

Functional needs express the services that the system must provide in response to a user
request to meet his expectations. Our system must offer the following functionalities:
 Authentication
 Prediction
 Database queries
 Data security (OTP authentication (one time password))

Non-functional needs 1.2

The non-functional needs express the internal requirements that the system must provide
such as the constraints related to the environment and to the implementation, the
requirements in terms of speed of response. Our system must have the following
characteristics:
 Ergonomics: must be pleasant and easy to use and the interfaces must be
homogeneous and consistent for users.
 Reliability: Our application must present accurate and fair results.
 Usability: the application must be easy to use, intuitive, and user-friendly.
 Maintenance: the application must be easy to maintain, update, and deploy.
 Data security

25
2. Conception

Identification of actors 2.1

Actors are the stakeholders (human, software or hardware) who interact with the system in
order to perform one or more tasks. We note below the actor who uses our application.
The actor who will use our application is:
 The administrator

The table below represents them and their functions:

Actor Roles
Administrator Authenticate
Consult dashboard
Consult list of clients
Consult suspicious accounts
Eliminate suspicious accounts
Consult form discussions

Table 1: Actors and their roles

Use case diagram 2.2

The overall use case diagram shown in the figure below is intended to make it easier for us
to understand how the application works.

 An administrator must authenticate to access the application

 An administrator can consult dashboard of default payment
 An administrator can consult the list of clients and suspicious accounts
 An administrator can delete suspicious accounts
 An administrator can consult form discussions

26
Figure 22: Use case Diagram

Text description 2.3

Use case: “Eliminate suspicious accounts”
Objective: See the probability that a customer is involved in default-payment
activity and eliminate his account.
Actors: Administrator
Post condition: The suspicious account is deleted from the list of clients.
Nominal scenario:
1. The admin consults the list of clients
2. The admin clicks on the “consult suspicious accounts” button
3. The system makes predictions and detect suspicious cases through client data
4. The system displays the list of predicted suspicious accounts
5. The admin clicks on the “eliminate suspicious account” button
6. The system displays a success message
Alternative scenario:
1. Data is invalid
2. The system returns an error message
Preconditions:
1. The administrator is well authenticated
Table 2: Text description of use case: “Eliminate suspicious accounts”

27
3. Realization and implementation
In this part, we will present the architecture adopted as well as the hardware and software
environment where our application was developed, then we will end up introducing some
interfaces.

Technical and technological specification 3.1

3.1.1 Architecture adopted (multi-tier architecture)
In software engineering, multitier architecture (often referred to as n-tier architecture) is
a client–server architecture in which presentation, application processing and data
management functions are physically separated. The most widespread use of multitier
architecture is the three-tier architecture.
N-tier application architecture provides a model by which developers can create flexible
and reusable applications. By segregating an application into tiers, developers acquire the
option of modifying or adding a specific tier, instead of reworking the entire application. A
three-tier architecture is typically composed of a presentation tier, a logic tier, and
a data tier.
The figure below presents the different parts of multi-tier architecture [8].

Figure 23: Multi-tier architecture

28
3.1.2 Hardware environment
This application was developed on an Asus machine with the following characteristics:

Characteristic PC
Processor Intel(R) Core(TM) i7
Ram 16 GO
Operating system Microsoft Windows 11

Hard disk 128 SSD + 1to HDD

Table 3: Hardware environment

3.1.3 Software environment

The development of computer applications requires software tools that are installed in our
work environment and that we used during the realization of our project.

Software Logo Description

Kaggle Kaggle, a subsidiary of Google
LLC, is an online community
of data scientists and machine
learning practitioners. Kaggle
allows users to find and publish
data sets, explore and build
models in a web-based data-
science environment and enter
competitions to solve data
science challenges. [9]
Visual Studio VS Code is a source-code
Code editor made by Microsoft with
the Electron Framework,
for Windows, Linux and macOS
.
[10]

29
Python Python is a high-level, general-
purpose programming language.
Its design philosophy
emphasizes code
readability with the use of
significant indentation via
the off-side rule. [11]
Streamlit Streamlit is an open-source
Python library that makes it easy
to create and share beautiful,
custom web apps for machine
learning and data science.
[12]
Table 4: Software environment

Overview of the app 3.2

3.2.1 Authentication page

The administrator can authenticate by entering his username and his password correctly.
The figure below shows the login page of our application.

Figure 24: Login page

30
3.2.2 Dashboard page

After correctly entering the login data, the admin will be redirected to the dashboard page
where he will find the payment default prediction dashboard and the classification of
costumers by sex or age.

Figure 25: Dashboard page

31
3.2.3 List of clients page
The admin can be redirected to the list of clients data where he can click on the “consult
suspicious accounts” button and see the list of clients who are involved in default-payment
activity.

Figure 26: List of clients page

3.2.4 List of suspicious accounts page

After clicking on the “consult suspicious accounts” button, the system make predictions
detect suspicious cases through client data and displays the list of predicted suspicious
accounts with a success message “Data treated”.

Figure 27: List of suspicious accounts page

32
The administrator can click on eliminate “eliminate client” button and the suspicious
account is deleted from the list of clients.

3.2.5 Form discussions page

The admin can be redirected to the forum discussions page and check the list of client’s
comments.

Figure 28: Forum discussions page

Conclusion
In this chapter, we have described the hardware and software environment, the design as well
as the architecture adopted to carry out this project, then we have presented some interfaces
and their operation.

33
General conclusion
In conclusion, predicting credit card payment default is a complex task that requires
careful consideration of various factors.
This study provides valuable insights into the use of machine learning techniques for
predicting credit card payment default and highlights the importance of choosing the
appropriate algorithm based on the specific context and limitations of the data.

This report presents our work carried out as part of our end-of-year project in the second year
of computer engineering at the National School of Engineers of Tunis. During this project, we
created a web application based on a machine learning model.
To carry out our project, we had to understand the different payment default techniques as
well as how machine learning works by attending several formations in machine learning,
before being able to set up a system based on the RandomForestClassifier machine learning
algorithm. We all know that machine learning and artificial intelligence help us prevent and
protect against fraud, but we wonder if this area will revolutionize security attacks within
companies?

34
Bibliography

[1] https://en.wikipedia.org/wiki/Kaggle

[2] https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset

[3] https://www.ibm.com/topics/logistic-regression#:~:text=Resources-,What%20is
%20logistic%20regression%3F,given%20dataset%20of%20independent%20variables.

[4] https://xgboost.readthedocs.io/en/stable/

[5] https://en.wikipedia.org/wiki/Feedforward_neural_network

[6]https://scikitlearn.org/stable/modules/generated/
sklearn.ensemble.RandomForestClassifier.html

[7]What Is Random Forest? A Complete Guide | Built In

[8] https://en.wikipedia.org/wiki/Multitier_architecture#Three-tier_architecture

[9] https://en.wikipedia.org/wiki/Kaggle

[10] https://en.wikipedia.org/wiki/Visual_Studio_Code

[11] https://en.wikipedia.org/wiki/Python_(programming_language)

[12] https://docs.streamlit.io/

35
Abstract

The project aims to design a web application that detects credit card payment default activities
using a machine learning model.

Keywords: Machine Learning, payment default, Artificial Intelligence, Algorithm, model,

Development, web, Database, credit card

Résumé

Le projet a pour objectif de concevoir une application web qui assure la détection des activités
de défaut de payement pour la carte de crédit l’aide d’un modèle de machine Learning.

Mots clés : Machine Learning, défaut de payement, Intelligence artificielle, Algorithme,

modèle, Développement, web, Base de données, carte de crédit

‫الملخص‬

.‫يهدف المشروع إلى تصميم تطبيق ويب يكتشف األنشطة االفتراضية للدفع ببطاقة االئتمان باستخدام نموذج التعلم اآللي‬

، ‫ الــويب‬، ‫ التطــوير‬، ‫ النمــوذج‬، ‫ الخوارزميــة‬، ‫ الــذكاء االصـطناعي‬، ‫ الــدفع االفتراضــي‬، ‫ التعلم اآللي‬: ‫الكلمات المفاتيح‬
‫ بطاقة االئتمان‬، ‫قاعدة البيانات‬

Telco Customer Churn Prediction Project Report
No ratings yet
Telco Customer Churn Prediction Project Report
40 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
89 pages
Model Design Process Anaplan
0% (1)
Model Design Process Anaplan
6 pages
WS 2020 ORBA - Motivation - Letter - Formular 04 2020 SK
No ratings yet
WS 2020 ORBA - Motivation - Letter - Formular 04 2020 SK
4 pages
1830 Census Form
No ratings yet
1830 Census Form
1 page
Assessment in Double Entry Accounting
No ratings yet
Assessment in Double Entry Accounting
7 pages
Zeineb Nouiri
No ratings yet
Zeineb Nouiri
90 pages
Report Final FINAL
No ratings yet
Report Final FINAL
72 pages
Thesis 8
No ratings yet
Thesis 8
86 pages
Turover Prediction
No ratings yet
Turover Prediction
52 pages
Major Project Documentation Saif
No ratings yet
Major Project Documentation Saif
74 pages
Major Project Documentation Azeez
No ratings yet
Major Project Documentation Azeez
74 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
70 pages
Loan Approval Prediction2
No ratings yet
Loan Approval Prediction2
72 pages
Loan Mount Prediction. ML Project Report
No ratings yet
Loan Mount Prediction. ML Project Report
6 pages
1.3.2 Final
No ratings yet
1.3.2 Final
72 pages
MLP Proj
No ratings yet
MLP Proj
37 pages
Sahir Final Year Project
No ratings yet
Sahir Final Year Project
45 pages
Final Thesis Report-Bhuvanesh Kumar J
No ratings yet
Final Thesis Report-Bhuvanesh Kumar J
72 pages
Enterprise Data Mining & Machine Learning Framework On Cloud Comp
No ratings yet
Enterprise Data Mining & Machine Learning Framework On Cloud Comp
136 pages
Front 4
No ratings yet
Front 4
4 pages
Predicting & Optimizing Airlines Customer Satisfaction Using Clas
No ratings yet
Predicting & Optimizing Airlines Customer Satisfaction Using Clas
84 pages
Amanuel Negash
No ratings yet
Amanuel Negash
130 pages
Rapport Stage PFE Finale
No ratings yet
Rapport Stage PFE Finale
57 pages
الترجمة والملف الاصلي
No ratings yet
الترجمة والملف الاصلي
23 pages
Customer Churn Internship Report PDF
No ratings yet
Customer Churn Internship Report PDF
34 pages
CAPSTONE Report Format
No ratings yet
CAPSTONE Report Format
43 pages
Project Loan Automl
No ratings yet
Project Loan Automl
52 pages
Sajid MA EEMCS
No ratings yet
Sajid MA EEMCS
126 pages
Batch 14
No ratings yet
Batch 14
72 pages
ProjectReport Print
No ratings yet
ProjectReport Print
41 pages
1822 B.E Cse Batchno 6
No ratings yet
1822 B.E Cse Batchno 6
60 pages
Customer Churn Prediction Report
No ratings yet
Customer Churn Prediction Report
4 pages
1822 B.E Cse Batchno 92
No ratings yet
1822 B.E Cse Batchno 92
69 pages
Proposal AASTU
No ratings yet
Proposal AASTU
9 pages
Rapport ML
No ratings yet
Rapport ML
30 pages
Bank Customer Churn Prediction
No ratings yet
Bank Customer Churn Prediction
38 pages
Project Proposal 260 Copy
No ratings yet
Project Proposal 260 Copy
38 pages
Myfinaldoc
No ratings yet
Myfinaldoc
77 pages
Batch 1 Job Market Analysis and Prediction-1
No ratings yet
Batch 1 Job Market Analysis and Prediction-1
60 pages
Business Forecasting System 181103
No ratings yet
Business Forecasting System 181103
51 pages
Thesis Machine Learning
No ratings yet
Thesis Machine Learning
29 pages
Application of Machine Learning Techniques On Traffic Data For Customer's Segmentation, Churn Prediction and Customer's Lifetime Value Evaluation
No ratings yet
Application of Machine Learning Techniques On Traffic Data For Customer's Segmentation, Churn Prediction and Customer's Lifetime Value Evaluation
113 pages
BUSINESS FORECASTING SYSTEM 181103 Update 29 12 22
No ratings yet
BUSINESS FORECASTING SYSTEM 181103 Update 29 12 22
52 pages
Predicting Student Performance
No ratings yet
Predicting Student Performance
38 pages
Mini Project Final
No ratings yet
Mini Project Final
29 pages
Batch 3
No ratings yet
Batch 3
22 pages
Rapportml
No ratings yet
Rapportml
54 pages
Kohon en Thesis
No ratings yet
Kohon en Thesis
191 pages
PROJECT
No ratings yet
PROJECT
70 pages
IBM Data Science Project - Round2
No ratings yet
IBM Data Science Project - Round2
32 pages
Project Report: Application of Machine Learning
No ratings yet
Project Report: Application of Machine Learning
12 pages
Sample Report
No ratings yet
Sample Report
34 pages
Intership
No ratings yet
Intership
23 pages
Atb Pfe
No ratings yet
Atb Pfe
16 pages
Seminar Report
No ratings yet
Seminar Report
69 pages
Data Modeling Project
No ratings yet
Data Modeling Project
5 pages
ML - Report
No ratings yet
ML - Report
10 pages
Group Thesis Part 1
No ratings yet
Group Thesis Part 1
17 pages
Predicting Students Performance by Learning Analytics
No ratings yet
Predicting Students Performance by Learning Analytics
51 pages
Arul Final PPP
No ratings yet
Arul Final PPP
45 pages
Pfa2 2024 13
No ratings yet
Pfa2 2024 13
48 pages
5 Idemte 2 Eme
No ratings yet
5 Idemte 2 Eme
15 pages
Ahmed Zioudi
No ratings yet
Ahmed Zioudi
52 pages
Répartition Des Étudiants BUS
No ratings yet
Répartition Des Étudiants BUS
2 pages
Republic of The Philippines Department of Education Region Vii, Central Visayas Division of Cebu Province Self-Learning Home Task (SLHT)
100% (2)
Republic of The Philippines Department of Education Region Vii, Central Visayas Division of Cebu Province Self-Learning Home Task (SLHT)
20 pages
Tata Nano Car
No ratings yet
Tata Nano Car
34 pages
Education Statistics 2017-18 - EN
No ratings yet
Education Statistics 2017-18 - EN
7 pages
IMRD Factors Affecting Skills
No ratings yet
IMRD Factors Affecting Skills
3 pages
Chap07 DMMvideo
No ratings yet
Chap07 DMMvideo
40 pages
Labininay Carl Case Study1 DCIT65
No ratings yet
Labininay Carl Case Study1 DCIT65
4 pages
BodyLanguagefor Leaders PDF
No ratings yet
BodyLanguagefor Leaders PDF
14 pages
Order 19973751
No ratings yet
Order 19973751
37 pages
Socialization of Agriculture
No ratings yet
Socialization of Agriculture
2 pages
Final Guidelines For AFRL - Endorsed by ACCSQ
No ratings yet
Final Guidelines For AFRL - Endorsed by ACCSQ
7 pages
An Agricultural Robotfor Multipurpose Operationsina Greenhouse
No ratings yet
An Agricultural Robotfor Multipurpose Operationsina Greenhouse
11 pages
Economic Development: Monique L Bait Fran Christ P. Magat Far Eastern University - Manila
No ratings yet
Economic Development: Monique L Bait Fran Christ P. Magat Far Eastern University - Manila
3 pages
Class X Unit 3 DBMS
No ratings yet
Class X Unit 3 DBMS
78 pages
Contributions of Filipino Scientist
100% (1)
Contributions of Filipino Scientist
2 pages
Notif VO BVO 06 2024 23082024
No ratings yet
Notif VO BVO 06 2024 23082024
1 page
MS015 User Manual Multi
No ratings yet
MS015 User Manual Multi
90 pages
Abangan v. Abangan
No ratings yet
Abangan v. Abangan
2 pages
Organizational Planning, HR Planning & Career Planning
No ratings yet
Organizational Planning, HR Planning & Career Planning
6 pages
Canicosa Contract To Sell
No ratings yet
Canicosa Contract To Sell
5 pages
VCB White Paper Public
No ratings yet
VCB White Paper Public
17 pages
References
No ratings yet
References
3 pages
Pharm 2013
100% (6)
Pharm 2013
13 pages
Iqvue Presentation
No ratings yet
Iqvue Presentation
9 pages
Assignment 3 BTF3363
No ratings yet
Assignment 3 BTF3363
5 pages
Unit 3. Information Search Process
No ratings yet
Unit 3. Information Search Process
34 pages
Startup Data Scraping
No ratings yet
Startup Data Scraping
16 pages
Banking and Insurance
50% (2)
Banking and Insurance
13 pages

END of Year Project Report

Uploaded by

END of Year Project Report

Uploaded by

‫الجمهورية التونسية‬

‫وزارة التعليم العالي والبحث العلمي‬

End of Year Project Report

Second year in Computer Engineering

Comparison of machine learning techniques for the

Composition of the jury

Mrs BEN AZZOUZ Lamia President

Academic Year: 2022-2023

TO ALL THE PEOPLE WHO PARTICIPATED IN THE DEVELOPMENT OF THIS WORK.

Chapter 1. General Framework.............................................................................1

Chapter 2. Data visualization and pre-processing.................................................4

Chapter 3. Machine Learning model and tests....................................................16

Chapter 4. Model deployment.............................................................................26

Figure 1: Logo of Kaggle...........................................................................................................2

Table 1: Actors and their roles.................................................................................................26

Default payment is a significant challenge faced by banks and other financial

1. Presentation of the subject

2. Study of the existing

Figure 1: Logo of Kaggle

At the end of our work, we must be able to:

Presentation of data 1.1

Figure 2: Shape of the data

Presentation of variables 1.2

 ID: ID of each client

Data size 1.3

Figure 4: Data cleaning

Figure 5: Clean data table

Figure 6: Renaming the target

Figure 6: Histogram of predictions

Figure 7: Distribution of clients by sex

Distribution of clients by age 3.3

The correlation of the dataset 3.4

Figure 10: Data Pre-processing and feature engineering

Logistic regression 1.1

Figure 11: Logistic regression model

Figure 12: Result of Logistic Regression Model

XGBoost Classifier 1.2

XGBoost is an optimized distributed gradient boosting library designed to be

Figure 13: XGBoost Classifier algorithm

Figure 14: Result of XGBoost Classifier algorithm

Figure 15: Features importance of XGBoost Classifier

A feedforward neural network (FNN) is an artificial neural network wherein connections

Figure 16: FeedForward neural networks (FFNN)

Figure 17: The result of FeedForward neural

Figure 18: Features importance of FeedForward neural networks (FFNN) model

2. Model used: RandomForestClassifier

Figure 19: RandomForestClassifier model

Figure 20: The result of RandomForestClassifier

Functional needs 1.1

Non-functional needs 1.2

Identification of actors 2.1

The table below represents them and their functions:

Table 1: Actors and their roles

Use case diagram 2.2

 An administrator must authenticate to access the application

Text description 2.3

Technical and technological specification 3.1

Figure 23: Multi-tier architecture

Hard disk 128 SSD + 1to HDD

3.1.3 Software environment

Software Logo Description

Overview of the app 3.2

Figure 24: Login page

Figure 25: Dashboard page

Figure 26: List of clients page

3.2.4 List of suspicious accounts page

Figure 27: List of suspicious accounts page

3.2.5 Form discussions page

Figure 28: Forum discussions page

[7]What Is Random Forest? A Complete Guide | Built In

Keywords: Machine Learning, payment default, Artificial Intelligence, Algorithm, model,

Mots clés : Machine Learning, défaut de payement, Intelligence artificielle, Algorithme,

You might also like