0% found this document useful (0 votes)
13 views12 pages

An Automatic Credit Analysis Model

The document presents an automated credit analysis model utilizing the Recurrent CatBoost classifier to assess loan eligibility based on consumer credit history and transaction data. The model achieves a prediction accuracy of 88% and aims to streamline the loan approval process by leveraging machine learning techniques on a dataset of 55,596 instances. The research emphasizes the importance of identifying relevant variables for credit scoring and proposes a systematic approach to improve decision-making in lending practices.

Uploaded by

jana k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views12 pages

An Automatic Credit Analysis Model

The document presents an automated credit analysis model utilizing the Recurrent CatBoost classifier to assess loan eligibility based on consumer credit history and transaction data. The model achieves a prediction accuracy of 88% and aims to streamline the loan approval process by leveraging machine learning techniques on a dataset of 55,596 instances. The research emphasizes the importance of identifying relevant variables for credit scoring and proposes a systematic approach to improve decision-making in lending practices.

Uploaded by

jana k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

An automated credit analysis model using

recurrent catboost classifier

Ms. K. Sivasankari,
Assistant Professor
Computer Science and Engineering SRM Institute of Science and Technology,
Ramapuram Chennai, India
[email protected]

Darisi Venkata Dinesh Bharatwaja


Computer Science and Engineering SRM Institute of Science and Technology,
Ramapuram Chennai, India
[email protected]

Kaliki Yuvakrishna
Computer Science and Engineering SRM Institute of Science and Technology,
Ramapuram Chennai, India
[email protected]

Katragadda Sai Rohith


Computer Science and Engineering SRM Institute of Science and
Technology,Ramapuram Chennai, India
[email protected]

Abstract — Emerging growth of tractability of the consumer taking


internet of things provided various loans. Banking sectors keep the
benefits to the consumers. Peoples credit score as a sensitive factor for
started utilizing the android mobile initiating the loans. The proposed
devices for the purpose of taking system is developed by keeping the
loans and various banking existing techniques in practice and
activities. The transactions through to have an automated model that
mobile phones are kept increasing. trigger the eligibility of the loan
The eligibility criteria of people criteria in short span of time.
opted for loans need to have Machine learning algorithms are
significant level of credit history. highly helpful to create and fit and
Credit score provides the correlated credit pattern analysis

12
using historical data of the terms of the loan, such as the
consumer. The proposed model interest rate and the amount of the
considers Recurrent CatBoost loan. Automated credit scoring has
algorithm (RCB) with optimized become increasingly common in
weight balancing of transaction recent years, as lenders have
history is focused here. The sought to streamline the lending
proposed approach considers process and reduce the risk of
PPDai defaulter’s dataset collected lending. By using computer
from publicly available websites. algorithms to evaluate borrowers,
The data is available in UCI lenders can quickly and accurately
repository. Default data of 55596 assess creditworthiness, which can
instances with 24 features are help them make more informed
considered for analysis. The loss lending decisions[1].
rate of 0.156 is achieved with the
Machine Learning (ML) has been
proposed model. The prediction
increasingly used in credit analysis,
accuracy of 88% is achieved with
also known as credit scoring.
RCB model.
However, the vast majority of
Keywords— Machine learning, articles focus on ML techniques
Credit analysis, Data mining, Data and do not delve into what are the
analytics, Automated analysis. most relevant variables to define
good and bad payers. The objective
I. INTRODUCTION
of this research is to identify
Automated credit scoring is the published works that study the
process of using computer variables that define the customer
algorithms to assess the as a default or not, as well as to
creditworthiness of a borrower. identify what leads the consumer to
This involves analyzing a range of take credit even though he / she
factors that affect a borrower's does not have the resources to (re-
ability to repay a loan, such as their )pay. To achieve the objective of
credit history, income, employment this study, a systematic literature
status, and other financial review was carried out. The
obligations. Credit scoring models combination of automatic searches
are typically designed to assign a resulted in 36,639 articles, of
numerical score to each borrower, which 17 were relevant. The
based on their credit profile and studies found about the credit score
other relevant factors. Lenders use present similar rating analysis
these scores to determine the risk methods, and only the variables
of lending to a particular borrower, used in the models changed[2].
and to make decisions about the

13
He ensemble method incorporates the lending conditions in
several base classification brief period of time.
algorithms like Decision trees,
• Machine learning
Logistic Regression, Nearest
techniques are extremely
neighbour, Support Vector
beneficial to develop and
Machine, etc. to achieve better
match and associated credit
results. The objective of this paper
pattern analysis using
is to predict the credit score based
historical data of the
on different classifier models and
customer.
evaluate the performance of each
model based on the metrics. A • The suggested model
comparative analysis is done to incorporates Recurrent
identify the best classifier to catboost algorithm (RCB)
predict the credit score. The with efficient weight
evaluation metrics used for balancing of transaction
evaluating the model are Recall, history is emphasised here.
Precision, F-measure, and
Accuracy. Error measures like • The suggested strategy
MAE and RMSE of the model considers PPDai defaulter’s
were also used to evaluate the information collected from
model. This helps us to improve publically accessible
the decision in identifying the more webpages.
accurate classifier model. The • The material is accessible
dataset used for this analysis is the in UCI collection. Typical
Australian credit dataset from the data of 55596 occurrences
UCI Machine learning repository. with 24 characteristics are
Experimental results prove that the considered for analysis.
Random Forest and Extra tree
classifier model produces better The rest of the paper is formulated
accuracy in ensemble classifiers as below with Background study in
and the SVM model furnishes section II, followed by system tool
better accuracy in the base selection, model development and
classifier. configurations in section III, design
methodology is employed in
• The suggested system is section IV. The results and
developed by maintaining discussions are depicted in section
the existing techniques in V.
practise and to have an
automatic model that II. BACKGROUND STUDY
activate the suitability of

14
M. Al aradi et al. 2020 this paper not satisfied and the complexity
generates a high performance arises to create a new analysis
predictive method for loan model. Hence for the bankruptcy
approval prediction with the help prediction, the demand for utilizing
of decision trees. There are so the machine learning techniques is
many Experiments were done in increased because of its high
various kinds of tree methods. This performance. This algorithm is
ranges from the very easy and initially used in typical tree-based
understandable decision tree and models which has the capacity to
going up to the very difficult evaluate the feature importance of
random forests. The performance the models by itself. This showed
of these results is insufficient with that the feature importance
respect to simplified decision trees. evaluated by LIME could be an
This is because of the highlight appropriate idea of the feature
correlated and complex feature importance evaluated by tree-based
space and most of the critical models by it. Additionally, the
parameters that affects loan stability of the feature importance
approval were not reflected upon. is studied with the help of the
Hence, it obtained an unsuitable model’s predicted bankruptcy
over-simplified tree. But, the probability. This provides the
boosting provided very high suggestions of the possibility that
performance, relevance and observations of vital features can
interpretation through the be utilized as a basis for the fair
importance chart scoring accuracy treatment of loan eligibility
on testing dataset [98.75%], needs[4].
specificity [100%], minority class
Ugo.E 2022 Machine learning
prediction accuracy [92.85%], and
algorithms are considered as the
classification efficiency of
reforming processes in all fields
[97.0%]. Thus, for making decision
including; real-estate, security,
that is related to the eligibility of
bioinformatics, and the financial
loan applicants based on their
industry. The process of approving
characteristics, boosting-based
loan is one of the most difficult
decision-tree predictive model was
tasks faced in the banking industry.
proposed[3].
Python programming libraries on
M.S. Park et al. 2021 the number Kaggle’s Jupyter Notebook cloud
of data increases rapidly, and hence environment is used to process and
it is not possible to make evaluate the dataset. The results of
assumptions by the existing our research proved high
economic analysis method which is performance accuracy, with the

15
Random forest algorithm. It has the source of income of bank assets.
highest score of 95.55% and The main aim of banks is investing
Logistic regression containing the their assets in safe customers. In
lowest score of 80%. The recent days, the banks will only
performance of the proposed sanction loan after so many process
Method is very high comparing to of checking and evaluation.
the two of the three loan prediction However, it is not possible to find
models found in the literature in whether the chosen customer is
terms of precision-recall and safe or not. Hence, there is a
accuracy[5]. necessity to use different methods
in banking sector for the correct
Miraz Al Mamun 2022 Machine
selection of customer who pays
Learning (ML) algorithms are
loan on time. This method utilizes
helpful for separating the patterns
random forest algorithm for the
from a common loan-approved
categorization of data. Random
dataset and predicting the relevant
forests algorithm develops a
loan applicants. It makes use of the
method from trained dataset and
previous data from the customers
this method is used on test data and
to make the study, including their
the required result is obtained[7].
age, income type, loan annuity, last
credit bureau report, Type of Various existing state of art
organization they work for, and approaches are compared for
length of employment. In order to deriving the strong solution to the
identify the most deserving challenges in credit analysis[8]-
features, that is, the element which [12].
has the most impact on the
III. SYSTEM DESIGN
prediction output, ML methods
such as Random Forest, XGBoost, In this modern world, getting loans
Adaboost, Lightgbm, Decision from banks is very common. The
tree, and K-Nearest Neighbor were main business of the bank is
employed. The above algorithms lending money. The main benefit
are compared and evaluated against that the bank can get from loan is
one another with the help of the interest on it. But, because of
standard metrics. Among these, the the insufficient funds in the banks,
highest accuracy of 92% is it can only to distribute it to a
obtained by the Logistic limited number of people. The
Regression [6]. standard procedure entails
determining who is eligible for the
Murthy et al. 2020. The profit that
loan and who would be a better
is gained from loans is the main
option for the bank. The prediction

16
of loan eligibility is regarded as a regarding the credit default. The
classification problem. It includes process of combining two or more
predicting whether or not a loan classifiers to create a group model
will be approved. The discrete with better prediction potential.
values should be predicted using a They employ the bagging and
particular set of independent boosting techniques, as well as the
variables in such complex random forest technique.
situations. The bank's profits and Classifiers are tasked with
losses are mostly determined by enhancing the data's performance
loans. That is, it is contingent upon and maximizing efficiency. In
whether or not the customers addition to multi-class
default on their loan payments. The categorization, various grouping
bank may be able to reduce its strategies for binary classification
Non-Performing Assets once the are demonstrated in this paper.
loan defaulters have been COB, a brand-new method for
predicted. This makes the grouping that performs
significance of this review. The categorization very well, is the new
same features are processed based method. However, classification
on their associated weight on new noise and outlier data also
test data because the Loan compromise it. It is concluded that
Prediction System can the group-based algorithm
automatically measure the weight improves training data set results.
of each characteristic that is
Based on the applicant's data, the
involved in loan processing. In
banking institution must automate
most cases, the deadline for
the loan qualification process in
determining whether or not their
real time. When filling out a
loan will be approved is set for the
request form, information such as
applicant.
gender, marital status, income,
The applicant's information is credit history, education, and the
manually checked by bank number of dependents are used.
employees before the loan is They developed a system that
granted to the appropriate makes it easy to identify different
applicant. It takes a long time to types of applicants, determine
look over all applicants' personal which ones are eligible for a
information. A bank's credit risk particular loan amount, and
can be predicted using the artificial approach them specifically. This is
neural network model. The feed- regarded as a classification issue
forward back propagation neural because classification of everything
network makes a prediction is necessary prior to determining

17
whether the loan status is Yes or data analysis. Additionally, it uses
No. The system can quickly sklearn which is Scikit-learn that
determine whether a loan includes many clustering,
application will be approved or regression, and classification
rejected. To achieve the desired algorithms that are widely used in
outcome, the proposed method AI and machine learning. Then the
makes use of a different algorithm. model applies this technique to pre-
Because it contains all of the defined data set that has all the
necessary tools and libraries, the information about our customers.
Python programming language is In a linear pattern, the algorithms
one of the most popular and widely are implemented one after the
used in AI and ML. other. The data is then evaluated,
classified, and fed into the model to
IV.METHODOLOGY
train it. After each algorithm, the
precision rate is provided.
B. Model analysis
The model is trained with many
algorithms to get a precise result.
The Recurrent Cat-Boost Classifier
Algorithm, utilizes with a 70%
training set and a 30% testing set.
It is found that the Recurrent Cat-
Boost Classifier Algorithm provide
good precision. After the testing
procedure that it gets from the
training data sets, the model
predicts whether the current
candidate is a good candidate for
getting a loan acceptance. Hence,
Fig 1. System architecture of the result shows that, if the
proposed Credit analysis model detection of the capable borrower
(CAM) is good, then it more beneficial to
the organization.
A. Platform configurations
• •Time period for loan
It contains many libraries, like sanctioning will be
pandas for the filtering process, minimized.
matplotlib for plotting the data,
data visualization, and exploratory • The entire process will be

18
automated; hence human Once the Pre-processing is done,
error will be avoided the detailed evaluation is done to
analyse the data set and a clear idea
• Eligible applicant will be
of the characteristics of the data is
sanctioned loan without any
obtained. After the completion of
delay
Exploratory Data Analysis, it is
C. Data collection utilized for making Supervised and
Unsupervised learning models.
The dataset gathered for predicting
Initially, many hypotheses are
loan failure clients is fed into
made by looking at the data before
Training set and testing set.
the modelling process is reached.
Generally 80:20 proportions are
EDA is utilized in confirming and
used to segregate the training set
evaluating the hypotheses that are
and testing set. The data model that
made. In many cases, the
was produced with the help of
Exploratory Information Review is
RCB is applied on the training set
performed with the help of the
and depending on the test take
associated Uni-variate Analysis
fineness, Test set prediction is
Strategies. It mentions the insights
done. Some of the attributes in the
of each field in the raw information
dataset includes Loan-id, Gender,
index and the Bivariate Analysis is
Dependents, Education, self-
done for identifying the link
employed, Applicant Income,
between each factor in the
Coapplicant Income, LoanAmount,
information index and the objective
Loan_Amount_term,Credit_history
factor.
etc..
F. Model training
D. Data pre-process
Now the model is trained on the
The collected data will sometimes
training dataset and provides
have missing values which may
prediction for the test dataset. The
cause inconsistency. So, for
trained dataset is divided into two
obtaining best results, the data need
tracks such as train and testimony.
to be pre-processed. Thus, the
The model is trained on this
effectiveness of the algorithm is
training part and that helps to make
improved. The outliers should be
the prediction for the testimony
eliminated and the variables should
part. In this way, the prediction is
be converted. To resolve these
evaluated since the best prediction
issues, the chart function is
is provides for the testimony part
utilized.
(which we don't have for the test
E. Building a model dataset).

19
G. Implementation summary The Mathematically to estimate the
system design can be divided into target ith value belongs to
pre-processing stage, model catagories, with kth element then
creation stage and analysis stage. random permutation is defined as
below as equation (1).
Step 1: Pre-processing The pre-
processing stages consider the
customer request as query and
organize the customer details as
test data. The information contains
the personal information, credential
The value utilized for indication
data of the customers. It includes
takes the ith component of the
the existing loans payable and
CatBoost vector.
existing assets. These data are
formulated into keywords that Step 4: prediction The next steps
match the loan criteria. considers the correlation available
between the test data and the
Step 2: Data splitting The data is
complete data in the database. The
divided into training data and
prediction is obtained through the
testing data. 80% of the data is
actual data and the predicted data.
utilized as training data, 20% of the
The predicted values produce the
data is formulated as testing data.
eligibility of the consumer based
Step 3: Applying CatBoost on credit history.
regression algorithm The CatBoost
Pseudocode
regression model is derived from
the gradient boost regression Algorithm:Catboost_regression_al
model. The input data under test is gorithm
the categorical information
Input:Credit_DB
received from the customer
request. These data are compared Output:Eligibiity_class
with the training data with various
For i=1:N_Db_content
historical information recorded.
These categorical features need to For j=1:n_rows
be processed accurately.
Split tr_data= Credit_DB;
The selection of CatBoost
Ts_data=input_query
algorithm is helpful to analyse the
features of the database likely X= corr_score(tr_data, Ts_data)
available with maximum of
End
categorical features.

20
End Fig 3. Pre-processed data with
attributes
M= Catboost(tr_data, Ts_data)
Fig 3. Shows the attributes
Test_dynamic=M(tr_data, input-
extraction process where various
query)\
constraints are considered as the
Prediction_score=Test_dynamic attributes

End Fig 4. Correlation metrics using


Confusion
Repeat
matrix
The various results obtained from
the proposed analysis is discussed
in section V.
V. RESULTS AND
DISCUSSIONS

Fig 4. Correlation metrics using


Confusion matrix

Fig 2. Raw data Fig 4. Shows the correlation


metrics using confusion matrix
Fig 2. Shows the raw data collected formulated here using Recurrent
from PPDai dataset. The data CatBoost model (RCB). From the
contains 24 features proposed analysis various attributes
such as 90 days analysis, worst
payback analysis late payment
history the credit worthiness of the
consumer is developed here.
Further the system need to be
explored by comparing various
datasets of different default values.
VI. CONCLUSION
People who are approved for loans
must meet stringent criteria

21
regarding their eligibility. A Information Systems and
consumer's credit score determines Technologies (CISTI), Chaves,
whether or not they can get loans. Portugal, 2021, pp. 1-5, doi:
The credit score remains a critical 10.23919/CISTI52073.2021.94763
consideration when approving 50.
loans in the banking sector. The
[2] A. Safiya Parvin and B.
proposed system is built with the
Saleena, "An Ensemble Classifier
existing methods in mind and an
Model to Predict Credit Scoring -
automated model that quickly
Comparative Analysis," 2020 IEEE
determines whether a loan meets
International Symposium on Smart
the requirements. Using the
Electronic Systems (iSES)
consumer's past data, machine
(Formerly iNiS), Chennai, India,
learning algorithms can help create,
2020, pp. 27-30, doi:
fit, and correlate credit pattern
10.1109/iSES50453.2020.00017.
analyses. The optimized weight
balancing of transaction history is [3] Al Aradi, M., & Hewahi, N.
the focus of the proposed model, (2020, October). Prediction of
which takes into account the RCB stock price and direction using
(recurrent catboost algorithm). The neural networks: Datasets hybrid
dataset of PPDai defaulters, which modeling approach. In 2020
was gathered from websites that International Conference on Data
are accessible to the public, is Analytics for Business and
taken into consideration by the Industry: Way Towards a
proposed method. The data can be Sustainable Economy (ICDABI)
found in the UCI repository. The (pp. 1-6). IEEE.
analysis is looked at default data
[4] Park, M. S., Son, H., Hyun, C.,
consisting of 24 features and 55596
& Hwang, H. J. (2021).
instances. The proposed model
Explainability of machine learning
achieves a loss rate of 0.156. The
models for bankruptcy prediction.
RCB model has a prediction
IEEE Access, 9, 124887- 124899.
accuracy of 88 %.
[5] Pesce, P., Menini, M., Ugo, G.,
REFERENCES
Bagnasco, F., Dioguardi, M., &
[1] M. Pincovsky, A. Falcão, W. N. Troiano, G. (2022). Evaluation of
Nunes, A. Paula Furtado and R. C. periodontal indices among non-
L. V. Cunha, ""Machine Learning smokers, tobacco, and e-cigarette
applied to credit analysis: a smokers: A systematic review and
Systematic Literature Review"," network meta-analysis. Clinical
2021 16th Iberian Conference on Oral Investigations, 26(7), 4701-

22
4714. [10] M. T. Ribeiro,S. Singh, and C.
Guestrin, ‘‘‘Why should I trust
[6] Al Mamun, Miraz, Afia
you?’: Explaining the predictions
Farjana, and Muntasir Mamun.
of any classifier,’’ in Proc. 22nd
"Predicting Bank Loan Eligibility
ACM SIGKDD Int. Conf. Knowl.
Using Machine Learning Models
Discovery Data Mining, 2020, pp.
and Comparison Analysis."
1135–1144.
[7] Shiv, S. J., Murthy, S., &
[11] M. Kim, J. Kim, and K. Park,
Challuru, K. (2018, December).
‘‘A study on financial
Credit risk analysis using machine
characteristic of delisting
learning techniques. In 2018
companies by Kosdaq,’’ Rev.
Fourteenth International
Accounting Policy Stud., vol. 16,
Conference on Information
pp. 125–142, Mar. 2019.
Processing (ICINPRO) (pp. 1-5).
IEEE. [12] G. Ke, Q. Meng, T. Finley, T.
Wang, W. Chen, W. Ma, Q. Ye,
[8] L. H. Gilpin, D. Bau, B. Z.
and T.-Y. Liu‘‘LightGBM: A
Yuan, A. Bajwa, M. Specter, and
highly efficient gradient boosting
L. Kagal, ‘‘Explaining
decision tree,’’ in Proc. Adv.
explanations: An overview of
Neural Inf. Process. Syst., 2019,
interpretability of machine
pp. 3146–3154.
learning,’’ in Proc. IEEE 5th Int.
Conf. Data Sci. Adv. Analytics
(DSAA), Oct. 2020, pp. 80–89.
[9] D. V. Carvalho, E. M. Pereira,
and J. S. Cardoso, ‘‘Machine
learning interpretability: A survey
on methods and metrics,’’
Electronics, vol. 8, no. 8, p. 832,
Jul. 2019.

23

You might also like