0% found this document useful (0 votes)
111 views36 pages

School of Information Technology and Engineering M.Tech Software Engineering (Integrated) FALL SEMESTER 2020 - 2021

This document summarizes a student project on predicting loan approvals using software metrics and machine learning techniques. It introduces the problem of predicting whether a loan applicant will repay their loan or default. Several machine learning algorithms are tested on loan data, including logistic regression, decision trees, and random forest models. The decision tree algorithm achieved the best accuracy. Literature reviews summarize several other papers applying techniques like random forest, SVM, neural networks to loan approval prediction using Lending Club and other loan data.

Uploaded by

Srikar Satya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views36 pages

School of Information Technology and Engineering M.Tech Software Engineering (Integrated) FALL SEMESTER 2020 - 2021

This document summarizes a student project on predicting loan approvals using software metrics and machine learning techniques. It introduces the problem of predicting whether a loan applicant will repay their loan or default. Several machine learning algorithms are tested on loan data, including logistic regression, decision trees, and random forest models. The decision tree algorithm achieved the best accuracy. Literature reviews summarize several other papers applying techniques like random forest, SVM, neural networks to loan approval prediction using Lending Club and other loan data.

Uploaded by

Srikar Satya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

SCHOOL OF INFORMATION TECHNOLOGY AND

ENGINEERING
M.TECH SOFTWARE ENGINEERING (INTEGRATED)
FALL SEMESTER 2020 – 2021

SWE2020 - Software Metrics


Slot: G1

Topic: Loan Approval Prediction


Specific Field of research: Banking Sector

Submitted by:

18MIS0337 – JANAGAM MRUNAL REDDY


18MIS0369 – SRIKAR KOTRA
18MIS0370 – UMMADI VENKATA MOHAN KUMAR
18MIS0371 – TURAKA BHAVITEJ

Page 1
CONCEPT:
Using Software Metrics for making a flawless and automated
systematic loan approval prediction system.

INTRODUCTION
v Distribution of the loans is the core business part of almost every
banks. The main portion the bank’s assets is directly came from the
profit earned from the loans distributed by the banks.
v The prime objective in banking environment is to invest their
assets in safe hands where it is. Today many banks/financial
companies approves loan after a regress process of verification and
validation but still there is no surety whether the chosen applicant
is the deserving right applicant out of all applicants.
v Through this system we can predict whether that particular
applicant is safe or not and the whole process of validation of
features is automated by machine learning technique.
v The disadvantage of this model is that it emphasize different
weights to each factor but in real life sometime loan can be
approved on the basis of single strong factor only, which is not
possible through this system.
v Loan Prediction is very helpful for employee of banks as well as
for the applicant also. The aim of this Paper is to provide quick,
immediate and easy way to choose the deserving applicants.
v It can provide special advantages to the bank. The Loan Prediction
System can automatically calculate the weight of each features
taking part in loan processing and on new test data same features
are processed with respect to their associated weight .
v A time limit can be set for the applicant to check whether his/her
loan can be sanctioned or not. Loan Prediction System allows
jumping to specific application so that it can be check on priority
Page 2
basis. This Paper is exclusively for the managing authority of
Bank/finance company, whole process of prediction is done
privately no stakeholders would be able to alter the processing.
v Result against particular Loan Id can be send to various department
of banks so that they can take appropriate action on application.
This helps all others department to carried out other formalities.

ABSTRACT

Loan approval is a very important process for banking organizations.


The system approved or reject the loan applications. Recovery of
loans is a major contributing parameter in the financial statements of
a bank. It is very difficult to predict the possibility of payment of loan
by the customer. In recent years many researchers worked on loan
approval prediction systems. Machine Learning (ML)techniques are
very useful in predicting outcomes for large amount of data. In this
paper three machine learning algorithms, Logistic Regression(LR),
Decision Tree (DT) and Random Forest (RF)are applied to predict the
loan approval of customers. The experimental results conclude that
the accuracy of Decision Tree machine learning algorithm is better as
compared to Logistic Regression and Random Forest machine
learning approaches.

PROBLEM STATEMENT IN EXISTING SYSTEM:

A Dream Housing Finance company deals in all home loans. They


have presence across all urban, semi urban and rural areas. Customer
first apply for home loan after that company validates the customer
eligibility for loan.

Page 3
The company wants to automate the loan eligibility process (real
time) based on customer detail provided while filling online
application form. These details are Gender, Marital Status, Education,
Number of Dependents, Income, Loan Amount, Credit History and
others. To automate this process, they have given a problem to
identify the customers segments, those are eligible for loan amount so
that they can specifically target these customers. Here they have
provided a partial data set.

VARIOUS APPROACHES USED:

Decision Tree model: Model of computation in which an algorithm is


considered to be basically a decision tree, i.e., a sequence of branching
operations based on comparisons of some quantities, the comparisons
being assigned unit computational cost. 
Random Forest Model: It is an ensemble learning method for
classification, Regression and other tasks that operate by constructing a
multitude of decision Tree at training time. 
SVM (Support Vector Machine Model): Support-vector machines
(SVMs, also support-vector networks) are supervised learning models
with associated learning algorithms 
Linear Model: Linear models describe a continuous response variable
as a function of one or more predictor variables. They can help you
understand and predict the behaviour of complex systems or analyses
experimental, financial, and biological data. 
Neural network model: Artificial neural networks are forecasting
methods that are based on simple mathematical models of the brain.
They allow complex nonlinear relationships between the response
variable and its predictors.

Page 4
RELATED WORKS

LITERATURE REVIEW SUMMARY OF DIFFERENT


RESEARCH PAPERS

Zakaria Alomari, Dmitriy Fingerman [1] This paper proposes a


solution for predicting whether a peer-to-peer lending application at
Lending Club will be paid off or defaulted. The methodology for finding
the solution consisted of the following main stages: Data Exploration,
which is learning the properties of the dataset, Data Preprocessing,
which is preparing the data for analysis, and Classification which
included a long list of experiments with various classification algorithms
and various tuning parameters in those algorithms in order to find the
most effective classification model. The most effective classification
model was achieved using Random Forest and its accuracy is 71.75%.
This paper also proposes a solution for discovering interesting not
immediately evident relations between attributes of the Lending Club
loan applications. This is done using Association Rules mining
algorithm Apriori. This paper presents a list of selected interesting
associations that were discovered as part of this task.
 
Ashish Pandit [2] Although the overall prediction accuracy is good for
both the datasets, the prediction accuracy of defaulter instances is not
that good using all the algorithms. The major reason for this could be the
class imbalance i.e. high number of instances having class as ‘not
defaulters’, which results in biased output. 
• The prediction accuracy of defaulter instances obtained by using Cost
Sensitive Learning are considerably good as compared to results
obtained by not using it. The overall classification results also seem to
be relatively balanced. 
• Out of all the classification algorithms used on the Lending Club
dataset, AdaBoostM1 Algorithm with ‘Decision Stump’ as its base
classifier gives the best overall prediction accuracy. 

Page 5
• Out of all the classification algorithms used on the Merged dataset,
Bagging Algorithm with ‘J48’ as its base classifier gives the best overall
prediction accuracy. 
‘Loan_Duration’, ‘emp_length’ and ‘age’ are the most important factors
for predicting the class of the loan applicant (whether the applicant
would ‘default’ or ‘not’) in case of the Merged dataset 
• ‘zipcode’ and ‘interest rate’ are the most important factors for
predicting the class of the loan applicant (whether the applicant would
‘default’ or ‘not’) in case of the Lending Club dataset. 

Aboobyda Jafar Hamid, Tarig Mohammed Ahmed[3] In this paper,


three algorithms - j48, bayesNet and naiveBayes algorithms was used to
build a predictive models that can be used to predict and classify the
applications of loans that introduced by the customers to good or bad
loan by investigate customer behaviors and previous pay back credit.
The model has been implemented by using Weka application. After
applying classification's data mining techniques algorithms which are
j48, bayesNet and naiveBayes, we find that the best algorithm for loan
classification is j48 algorithm. J48 algorithm is best because it has high
accuracy and low mean absolute error. 

Sudhakar M, Dr. C. V. K Reddy [4] In this paper, we have presented a


two-step loan credibility prediction system that helps the organizations
in making the right decision to approve or reject the loan request of the
customers. This will definitely help the banking industry to open up
efficient delivery channels. Decision Tree Induction Algorithm is used
for the prediction. Incorporation of other techniques that outperform the
performance of popular data mining models has to be implemented and
tested for the domain. Data mining is the process to extract knowledge
from existing data. It is used as a tool in banking and finance in general
to discover useful information. Credit risk management is critical for
successful bank lending. We attempt to model the loan approval process
at one of India’s midsized banks. We obtained statistically significant
linear and nonlinear models to accomplish the above. A two-step credit
scoring or combined credit scoring model is very useful and accurately
Page 6
classifies the loan applications using traditional credit scoring and
improved behavior scoring. This model is very useful in decision
making for approving loan applications for existing and new customers.
We propose to extend the two step credit approval model by including
collateral, capacity and cash-flow parameters in the future research
areas. 

BARUN PAUDEL [5] It is deduced that cleaning the data and selecting
the most significant features for training a predictive model greatly
increases the accuracy of the model. Repayment rate can be predicted
with minimum RMSE value from the give dataset. The most influential
feature in the data is student demographics avg family income.
However, there are other important features:
student__demographics_race_ethnicity_black,student__demographics_
median_family_income, report_year, school_state etc. which influences
repayment_rate. But there are lots of other features which have less
significance to our scored label. More effective tuning in data cleaning
and feature selection can help to reduce RMSE value to a desired
threshold value. Using Boosted Decision Tree Regression model by
using lesser learning rate and higher number of decision trees greatly
increases the coverage and accuracy of the model. 

Emile J. Salame [6] The outcome of this study can be useful to a


financial institution to measure the portfolio risk. That is, the expected
loss given default. It can also be useful to accord the appropriate interest
rate; help in declining non-desirable loans; and to offer loans that are
profitable. The estimated models could have been used to estimate the
probability of default of loans that were not approved (rejected) by the
financial institution if data on rejected loans was available. 

K. Kala [7] Risk Assessment is the crucial task in the Banking industry.
This paper proposes a framework (CARE) for risk evaluation, where
mass volume of customer data are engendered and risk assessment plus
evaluation is done based on the Data mining technique. The customer
data are extracted for feature selection of the valuable attributes. The
Page 7
attributes are selected using Information gain theory. Rules prediction is
done for each loan type. Risk assessment is performed in two levels,
primary and secondary namely. Each risk levels consist of three
attributes to be evaluated.C4.5 algorithm is used to classify the risk
levels as low, medium and high, based on the percentage of risk values
obtained. A threshold value is set, so that the credit applicant below the
threshold value is rejected and remaining credits are sanctioned. The
sanctioned and rejected credit applicants are considered as ‘Good’ and
‘bad’ credits correspondingly. 

Gusti Ngurah NarindraMandala, Catharina BadraNawangpalupi


[8] Based on the model built, it is confirmed that collateral values is the
most important criterion in credit assessment, shown by the collateral
values is the root node. The proposed model has credit period as the first
leaf node followed by the collateral values. Although there are many
other variables in credit assessment criteria, the model shows that not
many of those variables are relevant for credit assessment criteria. This
model has been evaluated using the 84% data from 1028 data as the
evaluation data. In terms of the values of loan, model has also improved
the value of non-performing loans. Table 1 shows the comparison of the
predicted performing loans based on current decision making and the
proposed model from data mining methods. The model has shown a
good result, and PT BPR X is suggested to apply this model for their
credit assessment process. 

Kalyani R. Rawate , Prof. P. A. Tijare [9] This application can help


banks in predicting the future of loan and its status and depends on that
they can take action in initial days of loan. Using this application banks
can reduce the number of bad loans and from incurring sever losses.
Several R functions and packages were used to prepare the data and to
build the classification model. R Package libraries help in successful
data analysis and feature selection. Using this methodology bank can
easily identify the required information from huge amount of data sets
and helps in successful loan prediction to reduce the number of bad loan
problems. Data Mining techniques are very useful to the banking sector
Page 8
for better targeting and acquiring new customers, most valuable
customer retention, automatic credit approval which is used for fraud
prevention, fraud detection in real time, providing segment based
products, analysis of the customers, transaction patterns over time for
better retention and relationship, risk management and marketing. 

Sivasree M S , Rekha Sunny T [10] In this paper, we have presented a


loan credibility prediction system that helps the organizations in making
the right decision to approve or reject the loan request of the customers.
This will definitely help the banking industry to open up efficient
delivery channels. Decision Tree Induction Algorithm is used for the
prediction. Incorporation of other techniques that outperform the
performance of popular data mining models have to be implemented and
tested for the domain. 

ANALYSED FRAMEWORK

ARCHITECTURE DIAGRAM:

Page 9
FRAMEWORK

Page 10
FLOW OF INPUT DATA ILLUSTRATION THROUGH DATA
FLOW DIAGRAMS FOR LOAN APPROVAL AND PREDICTION
SYSTEM:

DATAFLOW INPUT&OUTPUT : LEVEL 0 DIAGRAM

Page 11
LEVEL 1 DFD

Page 12
DATASET FEATURES

• DATASET NAME : Loan Approval Prediction 


• DATASET URL: https://github.com/shri1407/Loan-Prediction-
Dataset 
• NO. OF INSTANCES: 982 
• NO. OF ATTRIBUTES: 13 
• YEAR: 2018 

Dataset Description

No. of attributes:
• 11+ 1 output attribute

Input variables:
• Gender
• Married
• Dependents
• Education
• Self Employed
• Applicant Income
• Coapplicant Income
• Loan Amount
• Loan Amount Term
• Credit History
• Property Area
• Loan Status

Output Variables:
• Loan Status

Page 13
SAMPLE DATA SET OUTPUT:

Page 14
SOFTWARE USED: R (PROGRAMMING TOOL)

PREPROCESSING THE DATA: Loading the dataset as a data frame.


Assuming that the current working directory is in the same directory
where the dataset is stored.

TRAINING DATA SUMMARY

Page 15
MISSING VALUES TREATMENT 

Page 16
EXECUTING THE DATASET

Page 17
DECISION TREE OF COMPLETELOAN DATASET

Page 18
EXECUTING THE DATASET USING ALL MODELS

Page 19
ERROR MATRIX FOR ALL THE MODELS

Page 20
INITIAL ERROR RATE GENERATED FOR COMPLETE LOAN
DATASET ACROSS DIFFERENT MODELS

Here the least error rate is generated by Neural Net model. Therefore the
most suitable model to be selected for the execution of the given dataset
is Neural Network

Page 21
ERROR PERCENTAGE FOR SEVEN NODES

SAMPLE OUTPUT FOR NODE 1

Page 22
DECISION TREE WITH N NO.OF NODES FOR COMPLETE
DATASET :

RESULTS OF PRODUCT WITH THE HELP OF METRICS

ANALYSING DATA

A boxplot is a normalized method of showing the conveyance of


information dependent
on a five number rundown ("least", first quartile (Q1), middle, third
quartile (Q3), and
Page 23
"most extreme"). It can inform you concerning your exceptions and what
their qualities
are. It can likewise let you know whether your information is balanced,
how firmly your
information is assembled, and if and how your information is slanted.
v Middle (Q2/50th Percentile): the center estimation of the dataset.
v First quartile (Q1/25th Percentile): the center number between the
most modest
number (not the "base") and the middle of the dataset.
v Third quartile (Q3/75th Percentile): the center an incentive
between the middle
and the most noteworthy worth (not the "greatest") of the dataset.
v Interquartile extend (IQR): 25th to the 75th percentile.

BOX PLOT

Applicant Income

Page 24
Co-applicant income

Loan amount

Page 25
Scatter plot(Applicant Income and Loan Amount)

A Scatter plot is a kind of plot or numerical graph utilizing Cartesian


directions to show esteems for commonly two factors for a lot of
information. In the event that the focuses are coded, one extra factor can
be shown.

Statistical Analysis

A correlation coefficient is a mathematical proportion of some kind of


connection, which means a measurable connection between two factors.
The factors might be two segments of a given informational collection of
perceptions, regularly called an example, or two parts of a multivariate
irregular variable with a known dissemination.

Page 26
PEARSON

In statistics, the Pearson correlation coefficient, also referred to as


Pearson's r, the Pearson product-moment correlation coefficient, or the
bivariate correlation, is a statistic that measures linear correlation
between two variables X and Y. It has a value between +1 and −1.

SPEARMAN

Page 27
In statistics, Spearman's rank correlation coefficient or Spearman's ρ,
named after Charles Spearman and often denoted by the Greek letter or
as, is a nonparametric measure of rank correlation. It assesses how well
the relationship between two variables can be described using a
monotonic function.

Page 28
LOAN STATUS BY OTHER VARIABLES

Page 29
Page 30
RESULTS AND DISCUSSIONS

TESTED RESULTS:

After Clustering: 

Error Rate Before Clustering: 11.7% 


Error Rate After Clustering: 8.29% 

The Decisional Attributes are : 

1. Loan Amount 
2. Applicant Income 
3. Coapplicant Income 

Number of Instances after Filtering: 


Loan Amount – 890 
Applicant Income – 868 
Coapplicant Income – 956 

Page 31
By Applying Boxplot on decisional attributes and finding its error rate
on best performing model.
For the Decision Tree Model :
Overall error: 10%, Averaged class error: 31.4%
For the Random Forest Model :
Overall error: 8%, Averaged class error: 25%
For the SVM Model :
Overall error: 9%, Averaged class error: 24.6%
For the Linear Model :
Overall error: 8.3%, Averaged class error: 24.6%
For the Neural Network Model :
Overall error: 8.7%, Averaged class error: 50%
The final number of instances after performing box plot on the dataset is
813. 
The error rate generated after applying box plot based on the attributes :

The lowest error rate of 8% is generated by Random Forest model.


Hence the best model that can be chosen is Random Forest. 

Page 32
Random Forest Model:
Random Forest is a flexible, easy to use machine learning algorithm that
produces, even without hyper-parameter tuning, a great result most of
the time. It is also one of the most used algorithms, because it’s
simplicity and the fact that it can be used for both classification and
regression tasks. In this post, you are going to learn, how the random
forest algorithm works and several other important things about it.

Advantages and Disadvantages:


An advantage of random forest is that it can be used for both regression
and classification tasks and that it’s easy to view the relative importance
it assigns to the input features.
Random Forest is also considered as a very handy and easy to use
algorithm, because it’s default hyperparameters often produce a good
prediction result. The number of hyperparameters is also not that high
and they are straightforward to understand.
One of the big problems in machine learning is overfitting, but most of
the time this won’t happen that easy to a random forest classifier. That’s
because if there are enough trees in the forest, the classifier won’t overfit
the model.
The main limitation of Random Forest is that a large number of trees can
make the algorithm to slow and ineffective for real-time predictions. In
general, these algorithms are fast to train, but quite slow to create
predictions once they are trained. A more accurate prediction requires
more trees, which results in a slower model. In most real-world
applications the random forest algorithm is fast enough, but there can
certainly be situations where run-time performance is important and
other approaches would be preferred.
And of course Random Forest is a predictive modeling tool and not a
descriptive tool. That means, if you are looking for a description of the
relationships in your data, other approaches would be preferred.

Page 33
CONCLUSION

1. Applicants not having a credit history are most likely to get their loan
approved 

2. Applicants with higher applicant are most likely to get their loan
approved 

3. Properties in urban areas with high growth perspectives loans are


most likely to get approved 

4. Using a more sophisticated model does not guarantee better results. 

5. Although accuracy reduced in random forest, but the cross-validation


score is improving showing that the model is generalizing well. Hence
random forest is best for this problem. 

REFERENCES

• https://github.com/shrikant-temburwar/Loan-Prediction-Dataset
• https://datahack.analyticsvidhya.com/contest/practice-problem-
loan-prediction-iii/
• https://www.datamentor.io/r-programming/
• http://www.alcula.com/calculators/statistics/box-plot/ 
• http://www.alcula.com/calculators/statistics/scatter-plot/  
• David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent
dirichlet allocation. Journal of machine Learning research3, Jan
(2003), 993--1022.        
• Nicoleta Caragea, Antoniade-Ciprian Alexandra, Ana Maria
Dobre, et al. 2014. R-a Global Sensation in Data Science. 
• https://spectrum.ieee.org/computing/software/the-2017-top-
programming-languages       

Page 34
• Tse-Hsun Chen, Stephen W Thomas, and Ahmed E Hassan. 2016.
A survey on the use of topic models when mining software
repositories. 
• Empirical Software Engineering 21, 5 (2016), Steven Andrew
Culpepper and Herman Aguinis. 2011.
• R is for revolution: A cutting-edge, free, open source statistical
package.Erik Linstead, Lindsey Hughes, Cristina Lopes, and Pierre
Baldi. 2009.
• R Development Core Team. 2008. R: A Language and
Environment for Statistical Computing. 
• R Foundation for Statistical Computing, Vienna, Austria,
http://www.R-project.org ISBN 3-9-00051-07-0.
• Stephen W Thomas, Bram Adams, Ahmed E Hassan, and
Dorothea Blostein. 2010.

Journal Articles

• [1] Alomari, Z., & Fingerman, D. Loan Default Prediction and


Identification of Interesting Relations between Attributes of Peer-
to-Peer Loan Applications.
• [2] Ashish Pandit. Data mining on loan approved datset for
predicting defaulters
• [3] Hamid, A. J., & Ahmed, T. M. (2016). Developing prediction
model of loan risk in banks using data mining. Machine Learning
and Applications: An International Journal (MLAIJ), 3(1).
• [4] Sudhakar, M., & Reddy, C. V. K. (2016). Two step credit risk
assessment model for retail bank loan applications using Decision
Tree data mining technique. International Journal of Advanced
Research in Computer Engineering & Technology (IJARCET),
5(3), 705-718.
• [5] Barun Paudel .Student Loan Repayment Prediction
• [6] Salame, E. (2011). Applying data mining techniques to
evaluate applications for agricultural loans.
• [7] K. Kala. A Customized Approach for Risk Evaluation and
Prediction based on Data Mining Technique
Page 35
• [8] Mandala, I. G. N. N., Nawangpalupi, C. B., & Praktikto, F. R.
(2012). Assessing credit risk: An application of data mining in a
rural bank. Procedia Economics and Finance, 4, 406-412.
• [9] Kalyani R. Rawate , Prof. P. A. Tijare Review on prediction
system for bank loan credibility
• [10] Sivasree M S , Rekha Sunny T. Loan Credibility Prediction
System Based on Decision Tree Algorithm

Page 36

You might also like