School of Information Technology and Engineering M.Tech Software Engineering (Integrated) FALL SEMESTER 2020 - 2021
School of Information Technology and Engineering M.Tech Software Engineering (Integrated) FALL SEMESTER 2020 - 2021
ENGINEERING
M.TECH SOFTWARE ENGINEERING (INTEGRATED)
FALL SEMESTER 2020 – 2021
Submitted by:
Page 1
CONCEPT:
Using Software Metrics for making a flawless and automated
systematic loan approval prediction system.
INTRODUCTION
v Distribution of the loans is the core business part of almost every
banks. The main portion the bank’s assets is directly came from the
profit earned from the loans distributed by the banks.
v The prime objective in banking environment is to invest their
assets in safe hands where it is. Today many banks/financial
companies approves loan after a regress process of verification and
validation but still there is no surety whether the chosen applicant
is the deserving right applicant out of all applicants.
v Through this system we can predict whether that particular
applicant is safe or not and the whole process of validation of
features is automated by machine learning technique.
v The disadvantage of this model is that it emphasize different
weights to each factor but in real life sometime loan can be
approved on the basis of single strong factor only, which is not
possible through this system.
v Loan Prediction is very helpful for employee of banks as well as
for the applicant also. The aim of this Paper is to provide quick,
immediate and easy way to choose the deserving applicants.
v It can provide special advantages to the bank. The Loan Prediction
System can automatically calculate the weight of each features
taking part in loan processing and on new test data same features
are processed with respect to their associated weight .
v A time limit can be set for the applicant to check whether his/her
loan can be sanctioned or not. Loan Prediction System allows
jumping to specific application so that it can be check on priority
Page 2
basis. This Paper is exclusively for the managing authority of
Bank/finance company, whole process of prediction is done
privately no stakeholders would be able to alter the processing.
v Result against particular Loan Id can be send to various department
of banks so that they can take appropriate action on application.
This helps all others department to carried out other formalities.
ABSTRACT
Page 3
The company wants to automate the loan eligibility process (real
time) based on customer detail provided while filling online
application form. These details are Gender, Marital Status, Education,
Number of Dependents, Income, Loan Amount, Credit History and
others. To automate this process, they have given a problem to
identify the customers segments, those are eligible for loan amount so
that they can specifically target these customers. Here they have
provided a partial data set.
Page 4
RELATED WORKS
Page 5
• Out of all the classification algorithms used on the Merged dataset,
Bagging Algorithm with ‘J48’ as its base classifier gives the best overall
prediction accuracy.
‘Loan_Duration’, ‘emp_length’ and ‘age’ are the most important factors
for predicting the class of the loan applicant (whether the applicant
would ‘default’ or ‘not’) in case of the Merged dataset
• ‘zipcode’ and ‘interest rate’ are the most important factors for
predicting the class of the loan applicant (whether the applicant would
‘default’ or ‘not’) in case of the Lending Club dataset.
BARUN PAUDEL [5] It is deduced that cleaning the data and selecting
the most significant features for training a predictive model greatly
increases the accuracy of the model. Repayment rate can be predicted
with minimum RMSE value from the give dataset. The most influential
feature in the data is student demographics avg family income.
However, there are other important features:
student__demographics_race_ethnicity_black,student__demographics_
median_family_income, report_year, school_state etc. which influences
repayment_rate. But there are lots of other features which have less
significance to our scored label. More effective tuning in data cleaning
and feature selection can help to reduce RMSE value to a desired
threshold value. Using Boosted Decision Tree Regression model by
using lesser learning rate and higher number of decision trees greatly
increases the coverage and accuracy of the model.
K. Kala [7] Risk Assessment is the crucial task in the Banking industry.
This paper proposes a framework (CARE) for risk evaluation, where
mass volume of customer data are engendered and risk assessment plus
evaluation is done based on the Data mining technique. The customer
data are extracted for feature selection of the valuable attributes. The
Page 7
attributes are selected using Information gain theory. Rules prediction is
done for each loan type. Risk assessment is performed in two levels,
primary and secondary namely. Each risk levels consist of three
attributes to be evaluated.C4.5 algorithm is used to classify the risk
levels as low, medium and high, based on the percentage of risk values
obtained. A threshold value is set, so that the credit applicant below the
threshold value is rejected and remaining credits are sanctioned. The
sanctioned and rejected credit applicants are considered as ‘Good’ and
‘bad’ credits correspondingly.
ANALYSED FRAMEWORK
ARCHITECTURE DIAGRAM:
Page 9
FRAMEWORK
Page 10
FLOW OF INPUT DATA ILLUSTRATION THROUGH DATA
FLOW DIAGRAMS FOR LOAN APPROVAL AND PREDICTION
SYSTEM:
Page 11
LEVEL 1 DFD
Page 12
DATASET FEATURES
Dataset Description
No. of attributes:
• 11+ 1 output attribute
Input variables:
• Gender
• Married
• Dependents
• Education
• Self Employed
• Applicant Income
• Coapplicant Income
• Loan Amount
• Loan Amount Term
• Credit History
• Property Area
• Loan Status
Output Variables:
• Loan Status
Page 13
SAMPLE DATA SET OUTPUT:
Page 14
SOFTWARE USED: R (PROGRAMMING TOOL)
Page 15
MISSING VALUES TREATMENT
Page 16
EXECUTING THE DATASET
Page 17
DECISION TREE OF COMPLETELOAN DATASET
Page 18
EXECUTING THE DATASET USING ALL MODELS
Page 19
ERROR MATRIX FOR ALL THE MODELS
Page 20
INITIAL ERROR RATE GENERATED FOR COMPLETE LOAN
DATASET ACROSS DIFFERENT MODELS
Here the least error rate is generated by Neural Net model. Therefore the
most suitable model to be selected for the execution of the given dataset
is Neural Network
Page 21
ERROR PERCENTAGE FOR SEVEN NODES
Page 22
DECISION TREE WITH N NO.OF NODES FOR COMPLETE
DATASET :
ANALYSING DATA
BOX PLOT
Applicant Income
Page 24
Co-applicant income
Loan amount
Page 25
Scatter plot(Applicant Income and Loan Amount)
Statistical Analysis
Page 26
PEARSON
SPEARMAN
Page 27
In statistics, Spearman's rank correlation coefficient or Spearman's ρ,
named after Charles Spearman and often denoted by the Greek letter or
as, is a nonparametric measure of rank correlation. It assesses how well
the relationship between two variables can be described using a
monotonic function.
Page 28
LOAN STATUS BY OTHER VARIABLES
Page 29
Page 30
RESULTS AND DISCUSSIONS
TESTED RESULTS:
After Clustering:
1. Loan Amount
2. Applicant Income
3. Coapplicant Income
Page 31
By Applying Boxplot on decisional attributes and finding its error rate
on best performing model.
For the Decision Tree Model :
Overall error: 10%, Averaged class error: 31.4%
For the Random Forest Model :
Overall error: 8%, Averaged class error: 25%
For the SVM Model :
Overall error: 9%, Averaged class error: 24.6%
For the Linear Model :
Overall error: 8.3%, Averaged class error: 24.6%
For the Neural Network Model :
Overall error: 8.7%, Averaged class error: 50%
The final number of instances after performing box plot on the dataset is
813.
The error rate generated after applying box plot based on the attributes :
Page 32
Random Forest Model:
Random Forest is a flexible, easy to use machine learning algorithm that
produces, even without hyper-parameter tuning, a great result most of
the time. It is also one of the most used algorithms, because it’s
simplicity and the fact that it can be used for both classification and
regression tasks. In this post, you are going to learn, how the random
forest algorithm works and several other important things about it.
Page 33
CONCLUSION
1. Applicants not having a credit history are most likely to get their loan
approved
2. Applicants with higher applicant are most likely to get their loan
approved
REFERENCES
• https://github.com/shrikant-temburwar/Loan-Prediction-Dataset
• https://datahack.analyticsvidhya.com/contest/practice-problem-
loan-prediction-iii/
• https://www.datamentor.io/r-programming/
• http://www.alcula.com/calculators/statistics/box-plot/
• http://www.alcula.com/calculators/statistics/scatter-plot/
• David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent
dirichlet allocation. Journal of machine Learning research3, Jan
(2003), 993--1022.
• Nicoleta Caragea, Antoniade-Ciprian Alexandra, Ana Maria
Dobre, et al. 2014. R-a Global Sensation in Data Science.
• https://spectrum.ieee.org/computing/software/the-2017-top-
programming-languages
Page 34
• Tse-Hsun Chen, Stephen W Thomas, and Ahmed E Hassan. 2016.
A survey on the use of topic models when mining software
repositories.
• Empirical Software Engineering 21, 5 (2016), Steven Andrew
Culpepper and Herman Aguinis. 2011.
• R is for revolution: A cutting-edge, free, open source statistical
package.Erik Linstead, Lindsey Hughes, Cristina Lopes, and Pierre
Baldi. 2009.
• R Development Core Team. 2008. R: A Language and
Environment for Statistical Computing.
• R Foundation for Statistical Computing, Vienna, Austria,
http://www.R-project.org ISBN 3-9-00051-07-0.
• Stephen W Thomas, Bram Adams, Ahmed E Hassan, and
Dorothea Blostein. 2010.
Journal Articles
Page 36